int, or integer: a number without a fractional part.
float, or floating point: a number that has both an integer and fractional part, separated by a point.
factor, with the value 1.10, is an example of a float.
str, or string: a type to represent text.
You can use single or double quotes to build a string.
bool, or boolean: a type to represent logical values. Can only be True or False (the capitalization is important!).
type() function
To determine the type of a, simply execute:
type(a)
Using the + operator to paste together two strings.
print("I started with $" + savings + " and now have $" + result + ". Awesome!")
This will not work, though, as you cannot simply sum strings and floats.
To fix the error, you'll need to explicitly convert the types of your variables. More specifically, you'll need str(), to convert a value into a string. str(savings), for example, will convert the float savings to a string.
Similar functions such as int(), float() and bool() will help you convert Python values into any type.
Using the + operator to paste together two strings.
print("I started with $" + savings + " and now have $" + result + ". Awesome!")
This will not work, though, as you cannot simply sum strings and floats.
To fix the error, you'll need to explicitly convert the types of your variables. More specifically, you'll need str(), to convert a value into a string. str(savings), for example, will convert the float savings to a string.
Similar functions such as int(), float() and bool() will help you convert Python values into any type.
Manipulating lists
.append() method
.extend() method on the list
.index() method
.pop() method
aList = [123, 'xyz', 'zara', 'abc'];
aList.append( 2009 );
print("Updated List : ", aList)
# Create a list containing the names: baby_names
baby_names = ['Ximena', 'Aliza', 'Ayden', 'Calvin']
# Extend baby_names with 'Rowen' and 'Sandeep'
baby_names.extend(['Rowen', 'Sandeep'])
# Print(baby_names)
print(baby_names)
# Find the position of 'Aliza': position
position = baby_names.index('Aliza')
# Remove 'Aliza' from baby_names
baby_names.pop(position)
# Print(baby_names)
print(baby_names)
Looping over lists
for loop
sorted()
The sorted() function returns a new list and does not affect the list you passed into the function.
A a list of this form:
['2011', 'FEMALE', 'HISPANIC', 'GERALDINE', '13', '75']
loop over this list of lists and append the names of each baby to a new list called baby_names.
baby_names =[]
for x in range(0, 3):
print("We're on time %d" % (x))
fruits = ['banana', 'apple', 'mango']
for fruit in fruits:
print('Current fruit :', fruit)
extract-first-item-of-each-sublist-in-python
# Create the empty list: baby_names
baby_names =[]
# Loop over records
for baby in records:
# Add the name to the list
baby_names.append(baby[3])
# Sort the names in alphabetical order
for name in sorted(baby_names):
# Print(each name)
print(name)
Tuples are fixed size in nature whereas lists are dynamic.
In other words, a tuple is immutable whereas a list is mutable.
Using and unpacking tuples
Tuples are made of several items just like a list, but they cannot be modified in any way.
It is very common for tuples to be used to represent data from a database.
If you have a tuple like ('chocolate chip cookies', 15) and you want to access each part of the data,
you can use an index just like a list.
However, you can also "unpack" the tuple into multiple variables such as
type, count = ('chocolate chip cookies', 15)
that will set type to 'chocolate chip cookies' and count to 15.
Often you'll want to pair up multiple array data types. The zip() function does just that. It will return a list of tuples containing one element from each list passed into zip().
When looping over a list, you can also track your position in the list by using the enumerate() function. The function returns the index of the list item you are currently on in the list and the list item itself.
You'll practice using the enumerate() and zip() functions in this exercise, in which your job is to pair up the most common boy and girl names. Two lists - girl_names and boy_names - have been pre-loaded into your workspace.
Instructions
Use the zip() function to pair up girl_names and boy_names into a variable called pairs.
Use a for loop to loop through pairs, using enumerate() to keep track of your position. Unpack pairs into the variables idx and pair.
Inside the for loop:
Unpack pair into the variables girl_name and boy_name.
Print(the rank, girl name, and boy name, in that order. The rank is contained )in idx.
==========================
get webpage contents with python
Urllib - GET Requests
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
import urllib2
resp = urllib2.urlopen('http://hiscore.runescape.com/index_lite.ws?player=zezima')
page = resp.read()
using Python 3.1
urllib.request.urlopen('http://www.python.org/')
data-types-for-data-science
==========
import
import sys
print(sys.path)
import sys
for pth in sys.path:
print(pth)
import os
os.getcwd()
os.chdir("/tmp/")
os.getcwd()
在 Python 中,import、import as 與 from import 可以出現在程式中可出現的任何位置
import xmath
print((xmath.max(10, 5)))
print((xmath.sum(1, 2, 3, 4, 5)))
import xmath as math # 為 xmath 模組取別名為 math
print((math.e))
from xmath import min # 將 min 複製至目前模組,不建議 from modu import *,易造成名稱衝突
print((min(10, 5)))
==========
import os
os.system("your command")
print("\n")
import os
os.system("dir c:\\")
==========
print(statement without newline or space
)
In Python 3, the print(statement print(without newline or space:)
print(str, end='')
Samples
max1 = a if a > b else b
def max(a, b):
return a if a > b else b
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
Please note that these examples are written in Python 2, and may need some adjustment to run under Python 3.
1 line: Output
print('Hello, world!')
2 lines: Input, assignment
name = raw_input('What is your name?\n')
print('Hi, %s.' % name)
3 lines: For loop, built-in enumerate function, new style formatting
friends = ['john', 'pat', 'gary', 'michael']
for i, name in enumerate(friends):
print("iteration {iteration} is {name}".format(iteration=i, )name=name)
4 lines: Fibonacci, tuple assignment
parents, babies = (1, 1)
while babies < 100:
print('This generation has {0} babies'.format(babies))
parents, babies = (babies, parents + babies)
5 lines: Functions
def greet(name):
print('Hello', name)
greet('Jack')
greet('Jill')
greet('Bob')
6 lines: Import, regular expressions
import re
for test_string in ['555-1212', 'ILL-EGAL']:
if re.match(r'^\d{3}-\d{4}$', test_string):
print(test_string, 'is a valid US local phone number')
else:
print(test_string, 'rejected')
7 lines: Dictionaries, generator expressions
prices = {'apple': 0.40, 'banana': 0.50}
my_purchase = {
'apple': 1,
'banana': 6}
grocery_bill = sum(prices[fruit] * my_purchase[fruit]
for fruit in my_purchase)
print('I owe the grocer $%.2f' % grocery_bill)
8 lines: Command line arguments, exception handling
# This program adds up integers in the command line
import sys
try:
total = sum(int(arg) for arg in sys.argv[1:])
print('sum =', total)
except ValueError:
print('Please supply integer arguments')
9 lines: Opening files
# indent your Python code to put into an email
import glob
# glob supports Unix style pathname extensions
python_files = glob.glob('*.py')
for file_name in sorted(python_files):
print(' ------' + file_name)
with open(file_name) as f:
for line in f:
print(' ' + line.rstrip())
print
10 lines: Time, conditionals, from..import, for..else
from time import localtime
activities = {8: 'Sleeping',
9: 'Commuting',
17: 'Working',
18: 'Commuting',
20: 'Eating',
22: 'Resting' }
time_now = localtime()
hour = time_now.tm_hour
for activity_time in sorted(activities.keys()):
if hour < activity_time:
print(activities[activity_time])
break
else:
print('Unknown, AFK or sleeping!')
11 lines: Triple-quoted strings, while loop
REFRAIN = '''
%d bottles of beer on the wall,
%d bottles of beer,
take one down, pass it around,
%d bottles of beer on the wall!
'''
bottles_of_beer = 99
while bottles_of_beer > 1:
print(REFRAIN % (bottles_of_beer, bottles_of_beer,)
bottles_of_beer - 1)
bottles_of_beer -= 1
12 lines: Classes
class BankAccount(object):
def __init__(self, initial_balance=0):
self.balance = initial_balance
def deposit(self, amount):
self.balance += amount
def withdraw(self, amount):
self.balance -= amount
def overdrawn(self):
return self.balance < 0
my_account = BankAccount(15)
my_account.withdraw(5)
print(my_account.balance)
13 lines: Unit testing with unittest
import unittest
def median(pool):
copy = sorted(pool)
size = len(copy)
if size % 2 == 1:
return copy[(size - 1) / 2]
else:
return (copy[size/2 - 1] + copy[size/2]) / 2
class TestMedian(unittest.TestCase):
def testMedian(self):
self.failUnlessEqual(median([2, 9, 9, 7, 9, 2, 4, 5, 8]), 7)
if __name__ == '__main__':
unittest.main()
14 lines: Doctest-based testing
def median(pool):
'''Statistical median to demonstrate doctest.
>>> median([2, 9, 9, 7, 9, 2, 4, 5, 8])
7
'''
copy = sorted(pool)
size = len(copy)
if size % 2 == 1:
return copy[(size - 1) / 2]
else:
return (copy[size/2 - 1] + copy[size/2]) / 2
if __name__ == '__main__':
import doctest
doctest.testmod()
15 lines: itertools
from itertools import groupby
lines = '''
This is the
first paragraph.
This is the second.
'''.splitlines()
# Use itertools.groupby and bool to return groups of
# consecutive lines that either have content or don't.
for has_chars, frags in groupby(lines, bool):
if has_chars:
print(' '.join(frags))
# PRINTS:
# This is the first paragraph.
# This is the second.
16 lines: csv module, tuple unpacking, cmp() built-in
import csv
# write stocks data as comma-separated values
writer = csv.writer(open('stocks.csv', 'wb', buffering=0))
writer.writerows([
('GOOG', 'Google, Inc.', 505.24, 0.47, 0.09),
('YHOO', 'Yahoo! Inc.', 27.38, 0.33, 1.22),
('CNET', 'CNET Networks, Inc.', 8.62, -0.13, -1.49)
])
# read stocks data, print(status messages)
stocks = csv.reader(open('stocks.csv', 'rb'))
status_labels = {-1: 'down', 0: 'unchanged', 1: 'up'}
for ticker, name, price, change, pct in stocks:
status = status_labels[cmp(float(change), 0.0)]
print('%s is %s (%s%%)' % (name, status, pct))
18 lines: 8-Queens Problem (recursion)
BOARD_SIZE = 8
def under_attack(col, queens):
left = right = col
for r, c in reversed(queens):
left, right = left - 1, right + 1
if c in (left, col, right):
return True
return False
def solve(n):
if n == 0:
return [[]]
smaller_solutions = solve(n - 1)
return [solution+[(n,i+1)]
for i in xrange(BOARD_SIZE)
for solution in smaller_solutions
if not under_attack(i+1, solution)]
for answer in solve(BOARD_SIZE):
print(answer)
20 lines: Prime numbers sieve w/fancy generators
import itertools
def iter_primes():
# an iterator of all numbers between 2 and +infinity
numbers = itertools.count(2)
# generate primes forever
while True:
# get the first number from the iterator (always a prime)
prime = numbers.next()
yield prime
# this code iteratively builds up a chain of
# filters...slightly tricky, but ponder it a bit
numbers = itertools.ifilter(prime.__rmod__, numbers)
for p in iter_primes():
if p > 1000:
break
print(p)
21 lines: XML/HTML parsing (using Python 2.5 or third-party library)
dinner_recipe = '''<html><body><table>
<tr><th>amt</th><th>unit</th><th>item</th></tr>
<tr><td>24</td><td>slices</td><td>baguette</td></tr>
<tr><td>2+</td><td>tbsp</td><td>olive oil</td></tr>
<tr><td>1</td><td>cup</td><td>tomatoes</td></tr>
<tr><td>1</td><td>jar</td><td>pesto</td></tr>
</table></body></html>'''
# In Python 2.5 or from http://effbot.org/zone/element-index.htm
import xml.etree.ElementTree as etree
tree = etree.fromstring(dinner_recipe)
# For invalid HTML use http://effbot.org/zone/element-soup.htm
# import ElementSoup, StringIO
# tree = ElementSoup.parse(StringIO.StringIO(dinner_recipe))
pantry = set(['olive oil', 'pesto'])
for ingredient in tree.getiterator('tr'):
amt, unit, item = ingredient
if item.tag == "td" and item.text not in pantry:
print("%s: %s %s" % (item.text, amt.text, unit.text))
28 lines: 8-Queens Problem (define your own exceptions)
BOARD_SIZE = 8
class BailOut(Exception):
pass
def validate(queens):
left = right = col = queens[-1]
for r in reversed(queens[:-1]):
left, right = left-1, right+1
if r in (left, col, right):
raise BailOut
def add_queen(queens):
for i in range(BOARD_SIZE):
test_queens = queens + [i]
try:
validate(test_queens)
if len(test_queens) == BOARD_SIZE:
return test_queens
else:
return add_queen(test_queens)
except BailOut:
pass
raise BailOut
queens = add_queen([])
print(queens)
print("\n".join(". "*q + "Q " + ". "*()BOARD_SIZE-q-1) for q in queens)
33 lines: "Guess the Number" Game (edited) from http://inventwithpython.com
import random
guesses_made = 0
name = raw_input('Hello! What is your name?\n')
number = random.randint(1, 20)
print('Well, {0}, I am thinking of a number between 1 and 20.'.format(name))
while guesses_made < 6:
guess = int(raw_input('Take a guess: '))
guesses_made += 1
if guess < number:
print('Your guess is too low.')
if guess > number:
print('Your guess is too high.')
if guess == number:
break
if guess == number:
print('Good job, {0}! You guessed my number in {1} guesses!'.format(name, )guesses_made)
else:
print('Nope. The number I was thinking of was {0}'.format(number))
Functions in Python Math Module
List of Functions in Python Math Module
Function
Description
ceil(x)
Returns the smallest integer greater than or equal to x.
copysign(x, y)
Returns x with the sign of y
fabs(x)
Returns the absolute value of x
factorial(x)
Returns the factorial of x
floor(x)
Returns the largest integer less than or equal to x
fmod(x, y)
Returns the remainder when x is divided by y
frexp(x)
Returns the mantissa and exponent of x as the pair (m, e)
fsum(iterable)
Returns an accurate floating point sum of values in the iterable
isfinite(x)
Returns True if x is neither an infinity nor a NaN (Not a Number)
isinf(x)
Returns True if x is a positive or negative infinity
isnan(x)
Returns True if x is a NaN
ldexp(x, i)
Returns x * (2**i)
modf(x)
Returns the fractional and integer parts of x
trunc(x)
Returns the truncated integer value of x
exp(x)
Returns e**x
expm1(x)
Returns e**x - 1
log(x[, base])
Returns the logarithm of x to the base (defaults to e)
log1p(x)
Returns the natural logarithm of 1+x
log2(x)
Returns the base-2 logarithm of x
log10(x)
Returns the base-10 logarithm of x
pow(x, y)
Returns x raised to the power y
sqrt(x)
Returns the square root of x
acos(x)
Returns the arc cosine of x
asin(x)
Returns the arc sine of x
atan(x)
Returns the arc tangent of x
atan2(y, x)
Returns atan(y / x)
cos(x)
Returns the cosine of x
hypot(x, y)
Returns the Euclidean norm, sqrt(x*x + y*y)
sin(x)
Returns the sine of x
tan(x)
Returns the tangent of x
degrees(x)
Converts angle x from radians to degrees
radians(x)
Converts angle x from degrees to radians
acosh(x)
Returns the inverse hyperbolic cosine of x
asinh(x)
Returns the inverse hyperbolic sine of x
atanh(x)
Returns the inverse hyperbolic tangent of x
cosh(x)
Returns the hyperbolic cosine of x
sinh(x)
Returns the hyperbolic cosine of x
tanh(x)
Returns the hyperbolic tangent of x
erf(x)
Returns the error function at x
erfc(x)
Returns the complementary error function at x
gamma(x)
Returns the Gamma function at x
lgamma(x)
Returns the natural logarithm of the absolute value of the Gamma function at x
pi
Mathematical constant, the ratio of circumference of a circle to it's diameter (3.14159...)
e
mathematical constant e (2.71828...)
def factorial(n):
num = 1
while n >= 1:
num = num * n
n = n - 1
return num
from math import factorial
print(factorial(1000))
def factorial(x):
result = 1
for i in xrange(2, x + 1):
result *= i
return result
print(factorial(1000))
def factorial(n):
if n < 2:
return 1
return n * factorial(n - 1)
def factorial(n):
base = 1
for i in range(n,0,-1):
base = base * i
print(base)
divmod(x, y)
returns a tuple (x / y, x % y)
The method list()
takes sequence types and converts them to lists.
This is used to convert a given tuple into list.
Note − Tuple are very similar to lists with only difference that element values of a tuple can not be changed and
tuple elements are put between parentheses instead of square bracket.
itertools.product()
This tool computes the cartesian product of input iterables.
It is equivalent to nested for-loops.
For example, product(A, B) returns the same as ((x,y) for x in A for y in B).
Sample Code
from itertools import product
print(list(product([1,2,3],repeat = 2)))
[(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)]
print(list(product([1,2,3],[3,4])))
[(1, 3), (1, 4), (2, 3), (2, 4), (3, 3), (3, 4)]
A = [[1,2,3],[3,4,5]]
print(list(product(*A)))
[(1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 3), (3, 4), (3, 5)]
B = [[1,2,3],[3,4,5],[7,8]]
print(list(product(*B)))
[(1, 3, 7), (1, 3, 8), (1, 4, 7), (1, 4, 8), (1, 5, 7), (1, 5, 8), (2, 3, 7), (2, 3, 8), (2, 4, 7), (2, 4, 8), (2, 5, 7), (2, 5, 8), (3, 3, 7), (3, 3, 8), (3, 4, 7), (3, 4, 8), (3, 5, 7), (3, 5, 8)]
How to use Loops in Python
For Loop
computer_brands = ["Apple", "Asus", "Dell", "Samsung"]
for brands in computer_brands:
print(brands)
numbers = [1,10,20,30,40,50]
sum = 0
for number in numbers:
sum = sum + number
print(sum)
for i in range(1,10):
print(i)
Break
To break out from a loop, you can use the keyword "break".
for i in range(1,10):
if i == 3:
break
print(i)
# will print 1 nd 2 only
Continue
The continue statement is used to tell Python to skip the rest of the statements
in the current loop block and to continue to the next iteration of the loop.
for i in range(1,10):
if i == 3:
continue
print(i)
# will not print 3 only
While Loop
computer_brands = ["Apple", "Asus", "Dell", "Samsung"]
i = 0
while i < len(computer_brands):
print(computer_brands(i))
i = i + 1
while True:
answer = raw_input("Start typing...")
if answer == "quit":
break
print("Your answer was", answer)
counter = 0
while counter <= 100:
print(counter)
counter + 2
Nested Loops
for x in range(1, 11):
for y in range(1, 11):
print('%d * %d = %d' % (x, y, x*y))
random
import random
a = [1,2,3,4,5,6]
print(a)
random.shuffle(a)
print(a)
items = [1, 2, 3, 4, 5, 6, 7]
random.shuffle(items)
print(items)
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/1-2-install/
Tensorflow 安装
https://medium.com/@lmoroney_40129/installing-tensorflow-with-gpu-on-windows-10-3309fec55a00
Installing TensorFlow with GPU on Windows 10
pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.0-py3-none-any.whl
The script wheel.exe is installed in 'd:\python36-32\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
import platform
print(platform.python_version())
help('modules')
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
Python Simple HTTP server
import SimpleHTTPServer
import SocketServer
PORT = 8000
Handler = SimpleHTTPServer.SimpleHTTPRequestHandler
httpd = SocketServer.TCPServer(("", PORT), Handler)
print("serving at port", PORT)
httpd.serve_forever()
about pip
C:\Users\User\Desktop>python -m pip install -U pylint --user
C:\Users\User\AppData\Roaming\Python\Python36\Scripts' which is not on PATH.
run Python from Sublime Text
use SublimeREPL
Tools -> Build System -> (choose) Python then:
To Run: Tools -> Build
-or-
Ctrl + BThis would start your file in the console which should be at the bottom of the editor.To Stop: Ctrl + Break or Tools -> Cancel Build
You can find out where your Break key is here: http://en.wikipedia.org/wiki/Break_key.
Note: CTRL + C will NOT work.
What to do when Ctrl + Break does not work:
Go to:
Preferences -> Key Bindings - User
and paste the line below:
{"keys": ["ctrl+shift+c"], "command": "exec", "args": {"kill": true} }
Now, you can use ctrl+shift+c instead of CTRL+BREAK
But when CtrlBdoes not work, Sublime Text probably can't find the Python Interpreter.
When trying to run your program, see the log and find the reference to Python in path.
[cmd: [u'python', u'-u', u'C:\\scripts\\test.py']]
[path: ...;C:\Python27 32bit;...]
The point is that it tries to run python via command line, the cmd looks like:
python -u C:\scripts\test.py
If you can't run python from cmd, Sublime Text can't too.
(Try it yourself in cmd, type python in it and run it, python commandline should appear)
SOLUTION
You can either change the Sublime Text build formula or the System %PATH%.
To set your %PATH%:
*You will need to restart your editor to load new %PATH%
Run Command Line* and enter this command: *needs to be run as administratorSETX /M PATH "%PATH%;<python_folder>"
for example: SETX /M PATH "%PATH%;C:\Python27;C:\Python27\Scripts"OR manually: (preferable)
Add ;C:\Python27;C:\Python27\Scripts at the end of the string.
'calendar' has no attribute 'month'
AttributeError: module 'calendar' has no attribute 'month'
import calendar
yy = 2018
mm = 11
print(calendar.month(yy, mm))
AttributeError: module 'calendar' has no attribute 'month'
The problem is that you used the name calendar.py for your file.
Use any other name, and you will be able to import the python module calendar.
Reading and Writing Files
file_object = open(“filename”, “mode”)
thisFile = open("thedatafile.txt","r")
Print(thisFile)
file = open(“testfile.txt”,”w”)
file.write(“Hello World”)
file.close()
a number of ways to read a text file in Python, not just one.
file = open(“testfile.text”, “r”)
print(file.read() )
print(file.read(5) #read the first five characters)
readline() read a file line by line
print(file.readline(): )
print(file.readline(3): # the third line in the file)
file.readlines() return every line
print(file.readlines() )
for line in file: # Looping over a file object
print(line,)
add an EOL character to start a new line
file.write(“This is a test”)
file.write(“To add more lines.”)
file.close()
fh.close() to end things, close the file completely
Opening a text file:
fh = open(“hello.txt”, “r”)
Reading a text file:
Fh = open(“hello.txt”, “r”)
print(fh.read() )
To read a text file one line at a time:
fh = open(“hello.text”, “r”)
print(fh.readline() )
To read a list of lines in a text file:
fh = open(“hello.txt”, “r”)
print(fh.readlines() )
To write new content or text to a file:
fh = open(“hello.txt”, “w”)
fh.write(“Put the text you want to add here”)
fh.write(“and more lines if need be.”)
fh.close()
write multiple lines to a file at once:
lines_of_text = [“One line of text here”, “and another line here”, “and yet another here”, “and so on and so forth”]
fh.writelines(lines_of_text)
fh.close()
To append a file:
fh = open(“hello.txt”, “a”)
fh.write(“We Meet Again World”)
fh.close
With Statement
with open(“testfile.txt”) as file:
data = file.read()
do something with data
with open(“testfile.txt”) as f:
for line in f:
print(line, )
the above example didn’t use the “file.close()” method because automatically execution
with open(“hello.txt”, “w”) as f:
f.write(“Hello World”)
To read a file line by line, output into a list:
with open(“hello.txt”) as f:
data = f.readlines()
Splitting Lines in a Text File
with open(“hello.text”, “r”) as f:
data = f.readlines()
for line in data:
words = line.split()
print(words)
# use a colon instead of a space to split
line.split(“:”)
======================
Python's list slice syntax can be used without indices for a few fun and useful things:
# You can clear all elements from a list:
>>> lst = [1, 2, 3, 4, 5]
>>> del lst[:]
>>> lst
[]
# You can replace all elements of a list
# without creating a new list object:
>>> a = lst
>>> lst[:] = [7, 8, 9]
>>> lst
[7, 8, 9]
>>> a
[7, 8, 9]
>>> a is lst
True
# You can also create a (shallow) copy of a list:
>>> b = lst[:]
>>> b
[7, 8, 9]
>>> b is lst
False
======================
CPython easter egg
# Here's a fun little CPython easter egg.
# Just run the following in a Python 2.7+
# interpreter session:
>>> import antigravity
unicodeescape codec can't decode bytes…
Unicode Error ”unicodeescape" codec can't decode bytes…
The problem is with the string
"C:\Users\Eric\Desktop\beeline.txt"
\U starts an eight-character Unicode escape, such as '\U00014321`.
findfiles
import fnmatch # fnmatch — Unix filename pattern matching
import os
images = ['*.', '*.py']
matches = []
for root, dirnames, filenames in os.walk("D:/Users/Lawht/Desktop"):
for extensions in images:
for filename in fnmatch.filter(filenames, extensions):
matches.append(os.path.join(root, filename))
print(filename)
for root, dirnames, filenames in os.walk("C:/Users/User/Desktop"):
print(root)
# print(dirnames)
# print(filenames)
for root in os.walk("C:/Users/User/Desktop"):
print(root)
walk()
walk() generates the file names in a directory tree
import os
for root, dirs, files in os.walk("."):
for name in files:
print(os.path.join(root, name))
for name in dirs:
print(os.path.join(root, name))
add two matrices
# Program to add two matrices using nested loop
X = [[12,7,3],[4,5,6],[7,8,9]]
Y = [[5,8,1],[6,7,3],[4,5,9]]
result = [[0,0,0],[0,0,0],[0,0,0]]
# iterate through rows
for i in range(len(X)):
# iterate through columns
for j in range(len(X[0])):
result[i][j] = X[i][j] + Y[i][j]
for r in result:
print(r)
Use Matrix library
use the numpy module, which has support for this.
import numpy as np
a = np.matrix([[1,2,3], [4,5,6], [7,8,9]])
b = np.matrix([[9,8,7], [6,5,4], [3,2,1]])
print(a+b)
# Emulate what the std lib does:
>>> import datetime
>>> today = datetime.date.today()
# Result of __str__ should be readable:
>>> str(today)
'2017-02-02'
# Result of __repr__ should be unambiguous:
>>> repr(today)
'datetime.date(2017, 2, 2)'
# Python interpreter sessions use
# __repr__ to inspect objects:
>>> today
datetime.date(2017, 2, 2)
use Python's built-in "dis"
# module to disassemble functions and
# inspect their CPython VM bytecode:
>>> def greet(name):
... return 'Hello, ' + name + '!'
>>> greet('Dan')
'Hello, Dan!'
>>> import dis
>>> dis.dis(greet)
2 0 LOAD_CONST 1 ('Hello, ')
2 LOAD_FAST 0 (name)
4 BINARY_ADD
6 LOAD_CONST 2 ('!')
8 BINARY_ADD
10 RETURN_VALUE
# @classmethod vs @staticmethod vs "plain" methods
# What's the difference?
class MyClass:
def method(self):
"""
Instance methods need a class instance and
can access the instance through `self`.
"""
return 'instance method called', self
@classmethod
def classmethod(cls):
"""
Class methods don't need a class instance.
They can't access the instance (self) but
they have access to the class itself via `cls`.
"""
return 'class method called', cls
@staticmethod
def staticmethod():
"""
Static methods don't have access to `cls` or `self`.
They work like regular functions but belong to
the class's namespace.
"""
return 'static method called'
# All methods types can be
# called on a class instance:
>>> obj = MyClass()
>>> obj.method()
('instance method called')
>>> obj.classmethod()
('class method called')
>>> obj.staticmethod()
'static method called'
# Calling instance methods fails
# if we only have the class object:
>>> MyClass.classmethod()
('class method called')
>>> MyClass.staticmethod()
'static method called'
>>> MyClass.method()
TypeError:
"unbound method method() must be called with MyClass "
"instance as first argument (got nothing instead)"
# In Python 3.4+ you can use contextlib.suppress() to selectively ignore specific exceptions:
import contextlib
with contextlib.suppress(FileNotFoundError):
os.remove('somefile.tmp')
# This is equivalent to:
try:
os.remove('somefile.tmp')
except FileNotFoundError:
pass
# Pythonic ways of checking if all items in a list are equal:
>>> lst = ['a', 'a', 'a']
>>> len(set(lst)) == 1
True
>>> all(x == lst[0] for x in lst)
True
>>> lst.count(lst[0]) == len(lst)
True
# Python's `for` and `while` loops
# support an `else` clause that executes
# only if the loops terminates without
# hitting a `break` statement.
def contains(haystack, needle):
"""
Throw a ValueError if `needle` not
in `haystack`.
"""
for item in haystack:
if item == needle:
break
else:
# The `else` here is a
# "completion clause" that runs
# only if the loop ran to completion
# without hitting a `break` statement.
raise ValueError('Needle not found')
>>> contains([23, 'needle', 0xbadc0ffee], 'needle')
None
>>> contains([23, 42, 0xbadc0ffee], 'needle')
ValueError: "Needle not found"
# better way for `for` and `while` loops support an `else` clause that executes only if the loops terminates without hitting a `break` statement., something like this:
def better_contains(haystack, needle):
for item in haystack:
if item == needle:
return
raise ValueError('Needle not found')
# Note: Typically you'd write something like this to do a membership test, which is much more Pythonic:
if needle not in haystack:
raise ValueError('Needle not found')
# Virtual Environments ("virtualenvs") keep your project dependencies separated.
# Before creating & activating a virtualenv: `python` and `pip` map to the system version of the Python interpreter (e.g. Python 2.7)
$ which python
/usr/local/bin/python
# Let's create a fresh virtualenv using another version of Python (Python 3): $ python3 -m venv ./venv
# A virtualenv is just a "Python environment in a folder":
$ ls ./venv
bin include lib pyvenv.cfg
# Activating a virtualenv configures the current shell session to use the python (and pip) commands from the virtualenv folder instead of the global environment:
$ source ./venv/bin/activate
# Note how activating a virtualenv modifies your shell prompt with a little note showing the name of the virtualenv folder:
(venv) $ echo "wee!"
# With an active virtualenv, the `python` command maps to the interpreter binary *inside the active virtualenv*:
(venv) $ which python
/Users/dan/my-project/venv/bin/python3
# Installing new libraries and frameworks with `pip` now installs them *into the virtualenv sandbox*, leaving your global environment (and any other virtualenvs) completely unmodified:
(venv) $ pip install requests
# To get back to the global Python environment, run the following command:
(venv) $ deactivate
# (See how the prompt changed back to "normal" again?)
$ echo "yay!"
# Deactivating the virtualenv flipped the `python` and `pip` commands back to the global environment:
$ which python
/usr/local/bin/python
# Python 3.3+ has a std lib module for displaying tracebacks even when Python "dies", e.g with a segfault:
import faulthandler
faulthandler.enable()
# Can also be enabled with "python -X faulthandler" from the command line.
# Learn more here: https://docs.python.org/3/library/faulthandler.html
interacting with databases
SQLAlchemy Python Tutorial interacting with databases
# pip install PyMysql
Python 資料庫圖解流程
Connection、Cursor比喻
import sqlite3
conn = sqlite3.connect("EX.db")
cur = conn.cursor()
def table():
cur.execute("CREATE TABLE exampl(rollno, REAL, Name TEXT, age, REAL)")
def value():
cur.execute("INSERT INTO exampl VALUES(1, "Albert", 23)")
conn.commit()
# conn.close()
# cur.close()
def show():
cur.execute("SELECT * FROM exampl")
data = cur.fetchall()
print(data) # print(cur.fetchall())
table()
value()
show()
Setting up a Python Development Environment in Sublime TextSetting up a Python Development Environment Visual Studio Code
to install vs code, search for visual studio code but not visual studio, vs code is free
activity bar on the left:
activity bar can be called out from command pallete Ctrl+shift+P
Ctrl+shift+E: Explorer
Ctrl+shift+F: search and replace
Ctrl+shift+G: github Source Control
Ctrl+shift+D: Debug
Ctrl+shift+X: Extensions,
the recommendations will have reason,
search for sublime text keymap,
popular extensions can be sorted according to rating, name, installs
Zen Mode Ctrl+K Z. Double Esc exits Zen Mode
python scripts:
import sys
print(sys.version)
print(sys.executable)
right click on screen to select run python file in terminal
on bottom status bar click on interpreter may change to use different versions interpreter
type cls in terminal may clear screen
changing interpreter will create a .vscode folder storing runtime environment
Ctrl+shift+P: vscode command pallete
type color theme will select color themes
type file icon to change file icons
on bottom status bar far left a config icon called manage can select command pallete
select default terminal by pressing F1or ctrl+Shift +P
type Shell and select Default Shell
pressing F1or ctrl+Shift +P and type default setting to show defaults
ctrl + ` will open terminal, type where python will show the path
or type python to enter python, type import sys, sys.executable will show path
to exit python type exit()
import sys
import requests
print(sys.version)
print(sys.executable)
print("hello")
r = requests.get("https://google.com")
print(r.status_code)
Python beep
import winsound
frequency = 1100 # Set Frequency To 2500 Hertz
duration = 1000 # Set Duration To 1000 ms == 1 second
winsound.Beep(frequency, duration)
The pass statement is a null operation;
nothing happens when it executes.
The pass is also useful in places where your code will eventually go,
but has not been written yet (e.g., in stubs for example) −
Example
for letter in 'Python':
if letter == 'h':
pass
print('This is pass block')
print('Current Letter :', letter)
print("Good bye!")
result −
Current Letter : P
Current Letter : y
Current Letter : t
This is pass block
Current Letter : h
Current Letter : o
Current Letter : n
Good bye!
use main() function to call functions
def main():
data = read_input_file('data.csv')
report = generate_report(data)
write_report(report)
# Application entry point -> call main()
main()
contextlib.suppress() function
contextlib.suppress() function available in Python 3.4
use contextlib.suppress() to selectively ignore specific exceptions using a context manager and the "with" statement:
import contextlib
with contextlib.suppress(FileNotFoundError):
os.remove('somefile.tmp')
This is equivalent to the following try/except clause:
try:
os.remove('somefile.tmp')
except FileNotFoundError:
pass
Parallel computing in Python
(in 60 seconds or less)
parallel programming in PythonParallel Processing in Python – A Practical Guide with Examplesparallel programming using Python's multiprocessing module
If your Python programs are slower than you'd like you can often speed them up by *parallelizing* them.
Basically, parallel computing allows you to carry out many calculations at the same time, thus reducing the amount of time it takes to run your program to completion.
I know, this sounds fairly vague and complicated somehow...but bear with me for the next 50 seconds or so.
Here's an end-to-end example of parallel computing in Python 2/3, using only tools built into the Python standard library—
Ready? Go!
First, we need to do some setup work. We'll import the "collections" and the "multiprocessing" module so we can use Python's parallel computing facilities and define the data structure we'll work with:
import collections
import multiprocessing
Second, we'll use "collections.namedtuple" to define a new (immutable) data type we can use to represent our data set, a collection of scientists:
Scientist = collections.namedtuple('Scientist', [
'name',
'born',
])
scientists = (
Scientist(name='Ada Lovelace', born=1815),
Scientist(name='Emmy Noether', born=1882),
Scientist(name='Marie Curie', born=1867),
Scientist(name='Tu Youyou', born=1930),
Scientist(name='Ada Yonath', born=1939),
Scientist(name='Vera Rubin', born=1928),
Scientist(name='Sally Ride', born=1951),
)
Third, we'll write a "data processing function" that accepts a scientist object and returns a dictionary containing the scientist's name and their calculated age:
def process_item(item):
return {
'name': item.name,
'age': 2017 - item.born
}
The process_item() function just represents a simple data transformation to keep this example short and sweet—but you could swap it out with a much more complex computation no problem.
(20 seconds remaining)
Fourth, and this is where the real parallelization magic happens, we'll set up a "multiprocessing pool" that allows us to spread our calculations across all available CPU cores.
Then we call the pool's map() method to apply our process_item() function to all scientist objects, in parallel batches:
pool = multiprocessing.Pool()
result = pool.map(process_item, scientists)
Note how batching and distributing the work across multiple CPU cores, performing the work, and collecting the results are all handled by the multiprocessing pool. How great is that?
Fifth, we're all done here with 5 seconds remaining—
Let's print(the results of our data transformation to the console so we can )make sure the program did what it was supposed to:
print(tuple(result))
That's the end of our little program. And here's what you should expect to see printed out on your console:
({'name': 'Ada Lovelace', 'age': 202},
{'name': 'Emmy Noether', 'age': 135},
{'name': 'Marie Curie', 'age': 150},
{'name': 'Tu Youyou', 'age': 87},
{'name': 'Ada Yonath', 'age': 78},
{'name': 'Vera Rubin', 'age': 89},
{'name': 'Sally Ride', 'age': 66})
Isn't Python just lovely?
Now, obviously I took some shortcuts here and picked an example that made parallelization seem effortless—
comprehensive data exploration with python
'The most difficult thing in life is to know yourself'
This quote belongs to Thales of Miletus.
Thales was a Greek/Phonecian philosopher, mathematician and astronomer, which is recognised as the first individual in Western civilisation known to have entertained and engaged in scientific thought.
I wouldn't say that knowing your data is the most difficult thing in data science, but it is time-consuming.
Therefore, it's easy to overlook this initial step and jump too soon into the water.
So I tried to learn how to swim before jumping into the water.
Based on Hair et al. (2013), chapter 'Examining your data', I did my best to follow a comprehensive, but not exhaustive, analysis of the data.
I'm far from reporting a rigorous study in this kernel, but I hope that it can be useful for the community, so I'm sharing how I applied some of those data analysis principles to this problem.
Despite the strange names I gave to the chapters, what we are doing in this kernel is something like:
Understand the problem
We'll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
Univariable study
We'll just focus on the dependent variable ('SalePrice') and try to know a little bit more about it.
Multivariate study
We'll try to understand how the dependent variable and independent variables relate.
Basic cleaning
We'll clean the dataset and handle the missing data, outliers and categorical variables.
Test assumptions
We'll check if our data meets the assumptions required by most multivariate techniques.
Now, it's time to have fun!
#invite people for the Kaggle party
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings ( 'ignore' )
% matplotlib inline
#bring in the six packs
df_train = pd.read_csv ( '../input/train.csv' )
#check the decoration
df_train.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')
1. So... What can we expect?
In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem.
I know this is time-consuming, but it will give us the flavour of our dataset.
In order to have some discipline in our analysis, we can create an Excel spreadsheet with the following columns:
Variable - Variable name.
Type - Identification of the variables' type.
There are two possible values for this field: 'numerical' or 'categorical'.
By 'numerical' we mean variables for which the values are numbers, and by 'categorical' we mean variables for which the values are categories.
Segment - Identification of the variables' segment.
We can define three possible segments: building, space or location.
When we say 'building', we mean a variable that relates to the physical characteristics of the building (e.g. 'OverallQual').
When we say 'space', we mean a variable that reports space properties of the house (e.g. 'TotalBsmtSF').
Finally, when we say a 'location', we mean a variable that gives information about the place where the house is located (e.g. 'Neighborhood').
Expectation - Our expectation about the variable influence in 'SalePrice'.
We can use a categorical scale with 'High', 'Medium' and 'Low' as possible values.
Conclusion - Our conclusions about the importance of the variable, after we give a quick look at the data.
We can keep with the same categorical scale as in 'Expectation'.
Comments - Any general comments that occured to us.
While 'Type' and 'Segment' is just for possible future reference, the column 'Expectation' is important because it will help us develop a 'sixth sense'.
To fill this column, we should read the description of all the variables and, one by one, ask ourselves:
Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about its 'Masonry veneer type'?).
If so, how important would this variable be? (e.g. What is the impact of having 'Excellent' material on the exterior instead of 'Poor'? And of having 'Excellent' instead of 'Good'?).
Is this information already described in any other variable? (e.g. If 'LandContour' gives the flatness of the property, do we really need to know the 'LandSlope'?).
After this daunting exercise, we can filter the spreadsheet and look carefully to the variables with 'High' 'Expectation'.
Then, we can rush into some scatter plots between those variables and 'SalePrice', filling in the 'Conclusion' column which is just the correction of our expectations.
I went through this process and concluded that the following variables can play an important role in this problem:
OverallQual (which is a variable that I don't like because I don't know how it was computed; a funny exercise would be to predict 'OverallQual' using all the other variables available).
YearBuilt.
TotalBsmtSF.
GrLivArea.
I ended up with two 'building' variables ('OverallQual' and 'YearBuilt') and two 'space' variables ('TotalBsmtSF' and 'GrLivArea').
This might be a little bit unexpected as it goes against the real estate mantra that all that matters is 'location, location and location'.
It is possible that this quick data examination process was a bit harsh for categorical variables.
For example, I expected the 'Neigborhood' variable to be more relevant, but after the data examination I ended up excluding it.
Maybe this is related to the use of scatter plots instead of boxplots, which are more suitable for categorical variables visualization.
The way we visualize data often influences our conclusions.
However, the main point of this exercise was to think a little about our data and expectactions, so I think we achieved our goal.
Now it's time for 'a little less conversation, a little more action please'.
Let's
shake it!
2. First things first: analysing 'SalePrice'
'SalePrice' is the reason of our quest.
It's like when we're going to a party.
We always have a reason to be there.
Usually, women are that reason. (disclaimer: adapt it to men, dancing or alcohol, according to your preferences)
Using the women analogy, let's build a little story, the story of 'How we met 'SalePrice''.
Everything started in our Kaggle party, when we were looking for a dance partner.
After a while searching in the dance floor, we saw a girl, near the bar, using dance shoes.
That's a sign that she's there to dance.
We spend much time doing predictive modelling and participating in analytics competitions, so talking with girls is not one of our super powers.
Even so, we gave it a try:
'Hi, I'm Kaggly! And you? 'SalePrice'? What a beautiful name! You know 'SalePrice', could you give me some data about you? I just developed a model to calculate the probability of a successful relationship between two people.
I'd like to apply it to us!'
#descriptive statistics summary
df_train [ 'SalePrice' ].describe ()
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
'Very well...
It seems that your minimum price is larger than zero.
Excellent! You don't have one of those personal traits that would destroy my model! Do you have any picture that you can send me? I don't know...
like, you in the beach...
or maybe a selfie in the gym?'
#histogram
sns.distplot ( df_train [ 'SalePrice' ]);
'Ah! I see you that you use seaborn makeup when you're going out...
That's so elegant! I also see that you:
Deviate from the normal distribution.Have appreciable positive skewness.Show peakedness.
This is getting interesting! 'SalePrice', could you give me your body measures?'
#skewness and kurtosis
print(( "Skewness: %f " % df_train [ 'SalePrice' ].skew ()))
print(( "Kurtosis: %f " % df_train [ 'SalePrice' ].kurt ()))
Skewness: 1.882876
Kurtosis: 6.536282
'Amazing! If my love calculator is correct, our success probability is 97.834657%.
I think we should meet again! Please, keep my number and give me a call if you're free next Friday.
See you in a while, crocodile!'
'SalePrice', her buddies and her interests
It is military wisdom to choose the terrain where you will fight.
As soon as 'SalePrice' walked away, we went to Facebook.
Yes, now this is getting serious.
Notice that this is not stalking.
It's just an intense research of an individual, if you know what I mean.
According to her profile, we have some common friends.
Besides Chuck Norris, we both know 'GrLivArea' and 'TotalBsmtSF'.
Moreover, we also have common interests such as 'OverallQual' and 'YearBuilt'.
This looks promising!
To take the most out of our research, we will start by looking carefully at the profiles of our common friends and later we will focus on our common interests.
Relationship with numerical variables
#scatter plot grlivarea/saleprice
var = 'GrLivArea'
data = pd.concat ([ df_train [ 'SalePrice' ], df_train [ var ]], axis = 1 )
data.plot.scatter ( x = var , y = 'SalePrice' , ylim = ( 0 , 800000 ));
Hmmm...
It seems that 'SalePrice' and 'GrLivArea' are really old friends, with a
linear relationship.
And what about 'TotalBsmtSF'?
#scatter plot totalbsmtsf/saleprice
var = 'TotalBsmtSF'
data = pd.concat ([ df_train [ 'SalePrice' ], df_train [ var ]], axis = 1 )
data.plot.scatter ( x = var , y = 'SalePrice' , ylim = ( 0 , 800000 ));
'TotalBsmtSF' is also a great friend of 'SalePrice' but this seems a much more emotional relationship! Everything is ok and suddenly, in a
strong linear (exponential?) reaction, everything changes.
Moreover, it's clear that sometimes 'TotalBsmtSF' closes in itself and gives zero credit to 'SalePrice'.
Relationship with categorical features
#box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat ([ df_train [ 'SalePrice' ], df_train [ var ]], axis = 1 )
f , ax = plt.subplots ( figsize = ( 8 , 6 ))
fig = sns.boxplot ( x = var , y = "SalePrice" , data = data )
fig.axis ( ymin = 0 , ymax = 800000 );
Like all the pretty girls, 'SalePrice' enjoys 'OverallQual'.
Note to self: consider whether McDonald's is suitable for the first date.
var = 'YearBuilt'
data = pd.concat ([ df_train [ 'SalePrice' ], df_train [ var ]], axis = 1 )
f , ax = plt.subplots ( figsize = ( 16 , 8 ))
fig = sns.boxplot ( x = var , y = "SalePrice" , data = data )
fig.axis ( ymin = 0 , ymax = 800000 );
plt.xticks ( rotation = 90 );
Although it's not a strong tendency, I'd say that 'SalePrice' is more prone to spend more money in new stuff than in old relics.
Note
: we don't know if 'SalePrice' is in constant prices.
Constant prices try to remove the effect of inflation.
If 'SalePrice' is not in constant prices, it should be, so than prices are comparable over the years.
In summary
Stories aside, we can conclude that:
'GrLivArea' and 'TotalBsmtSF' seem to be linearly related with 'SalePrice'.
Both relationships are positive, which means that as one variable increases, the other also increases.
In the case of 'TotalBsmtSF', we can see that the slope of the linear relationship is particularly high.
'OverallQual' and 'YearBuilt' also seem to be related with 'SalePrice'.
The relationship seems to be stronger in the case of 'OverallQual', where the box plot shows how sales prices increase with the overall quality.
We just analysed four variables, but there are many other that we should analyse.
The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).
That said, let's separate the wheat from the chaff.
3. Keep calm and work smart
Until now we just followed our intuition and analysed the variables we thought were important.
In spite of our efforts to give an objective character to our analysis, we must say that our starting point was subjective.
As an engineer, I don't feel comfortable with this approach.
All my education was about developing a disciplined mind, able to withstand the winds of subjectivity.
There's a reason for that.
Try to be subjective in structural engineering and you will see physics making things fall down.
It can hurt.
So, let's overcome inertia and do a more objective analysis.
The 'plasma soup'
'In the very beginning there was nothing except for a plasma soup.
What is known of these brief moments in time, at the start of our study of cosmology, is largely conjectural.
However, science has devised some sketch of what probably happened, based on what is known about the universe today.' (source:
http://umich.edu/~gs265/bigbang.htm
)
To explore the universe, we will start with some practical recipes to make sense of our 'plasma soup':
Correlation matrix (heatmap style).
'SalePrice' correlation matrix (zoomed heatmap style).
Scatter plots between the most correlated variables (move like Jagger style).
Correlation matrix (heatmap style)
#correlation matrix
corrmat = df_train.corr()
f , ax = plt.subplots ( figsize = ( 12 , 9 ))
sns.heatmap ( corrmat , vmax =.8 , square = True );
In my opinion, this heatmap is the best way to get a quick overview of our 'plasma soup' and its relationships. (Thank you @seaborn!)
At first sight, there are two red colored squares that get my attention.
The first one refers to the 'TotalBsmtSF' and '1stFlrSF' variables, and the second one refers to the 'Garage
X
' variables.
Both cases show how significant the correlation is between these variables.
Actually, this correlation is so strong that it can indicate a situation of multicollinearity.
If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs.
Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.
Another thing that got my attention was the 'SalePrice' correlations.
We can see our well-known 'GrLivArea', 'TotalBsmtSF', and 'OverallQual' saying a big 'Hi!', but we can also see many other variables that should be taken into account.
That's what we will do next.
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest ( k , 'SalePrice' )[ 'SalePrice' ].index
cm = np.corrcoef ( df_train [ cols ].values.T )
sns.set ( font_scale = 1.25 )
hm = sns.heatmap ( cm , cbar = True , annot = True , square = True , fmt = '.2f' , annot_kws = { 'size' : 10 }, yticklabels = cols.values , xticklabels = cols.values )
plt.show ()
According to our crystal ball, these are the variables most correlated with 'SalePrice'.
My thoughts on this:
'OverallQual', 'GrLivArea' and 'TotalBsmtSF' are strongly correlated with 'SalePrice'.
Check!
'GarageCars' and 'GarageArea' are also some of the most strongly correlated variables.
However, as we discussed in the last sub-point, the number of cars that fit into the garage is a consequence of the garage area.
'GarageCars' and 'GarageArea' are like twin brothers.
You'll never be able to distinguish them.
Therefore, we just need one of these variables in our analysis (we can keep 'GarageCars' since its correlation with 'SalePrice' is higher).
'TotalBsmtSF' and '1stFloor' also seem to be twin brothers.
We can keep 'TotalBsmtSF' just to say that our first guess was right (re-read 'So... What can we expect?').
'FullBath'?? Really?
'TotRmsAbvGrd' and 'GrLivArea', twin brothers again.
Is this dataset from Chernobyl?
Ah...
'YearBuilt'...
It seems that 'YearBuilt' is slightly correlated with 'SalePrice'.
Honestly, it scares me to think about 'YearBuilt' because I start feeling that we should do a little bit of time-series analysis to get this right.
I'll leave this as a homework for you.
Let's proceed to the scatter plots.
Scatter plots between 'SalePrice' and correlated variables (move like Jagger style)
Get ready for what you're about to see.
I must confess that the first time I saw these scatter plots I was totally blown away! So much information in so short space...
It's just amazing.
Once more, thank you @seaborn! You make me 'move like Jagger'!
#scatterplot
sns.set ()
cols = [ 'SalePrice' , 'OverallQual' , 'GrLivArea' , 'GarageCars' , 'TotalBsmtSF' , 'FullBath' , 'YearBuilt' ]
sns.pairplot ( df_train [ cols ], size = 2.5 )
plt.show ();
Although we already know some of the main figures, this mega scatter plot gives us a reasonable idea about variables relationships.
One of the figures we may find interesting is the one between 'TotalBsmtSF' and 'GrLiveArea'.
In this figure we can see the dots drawing a linear line, which almost acts like a border.
It totally makes sense that the majority of the dots stay below that line.
Basement areas can be equal to the above ground living area, but it is not expected a basement area bigger than the above ground living area (unless you're trying to buy a bunker).
The plot concerning 'SalePrice' and 'YearBuilt' can also make us think.
In the bottom of the 'dots cloud', we see what almost appears to be a shy exponential function (be creative).
We can also see this same tendency in the upper limit of the 'dots cloud' (be even more creative).
Also, notice how the set of dots regarding the last years tend to stay above this limit (I just wanted to say that prices are increasing faster now).
Ok, enough of Rorschach test for now.
Let's move forward to what's missing: missing data!
4. Missing data
Important questions when thinking about missing data:
How prevalent is the missing data?
Is missing data random or does it have a pattern?
The answer to these questions is important for practical reasons because missing data can imply a reduction of the sample size.
This can prevent us from proceeding with the analysis.
Moreover, from a substantive perspective, we need to ensure that the missing data process is not biased and hidding an inconvenient truth.
#missing data
total = df_train.isnull ().sum ().sort_values ( ascending = False )
percent = ( df_train.isnull ().sum () / df_train.isnull ().count ()).sort_values ( ascending = False )
missing_data = pd.concat ([ total , percent ], axis = 1 , keys = [ 'Total' , 'Percent' ])
missing_data.head ( 20 )
Total
Percent
PoolQC
1453
0.995205
MiscFeature
1406
0.963014
Alley
1369
0.937671
Fence
1179
0.807534
FireplaceQu
690
0.472603
LotFrontage
259
0.177397
GarageCond
81
0.055479
GarageType
81
0.055479
GarageYrBlt
81
0.055479
GarageFinish
81
0.055479
GarageQual
81
0.055479
BsmtExposure
38
0.026027
BsmtFinType2
38
0.026027
BsmtFinType1
37
0.025342
BsmtCond
37
0.025342
BsmtQual
37
0.025342
MasVnrArea
8
0.005479
MasVnrType
8
0.005479
Electrical
1
0.000685
Utilities
0
0.000000
Let's analyse this to understand how to handle the missing data.
We'll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed.
This means that we will not try any trick to fill the missing data in these cases.
According to this, there is a set of variables (e.g. 'PoolQC', 'MiscFeature', 'Alley', etc.) that we should delete.
The point is: will we miss this data? I don't think so.
None of these variables seem to be very important, since most of them are not aspects in which we think about when buying a house (maybe that's the reason why data is missing?).
Moreover, looking closer at the variables, we could say that variables like 'PoolQC', 'MiscFeature' and 'FireplaceQu' are strong candidates for outliers, so we'll be happy to delete them.
In what concerns the remaining cases, we can see that 'Garage
X
' variables have the same number of missing data.
I bet missing data refers to the same set of observations (although I will not check it; it's just 5% and we should not spend 20 in 5 problems).
Since the most important information regarding garages is expressed by 'GarageCars' and considering that we are just talking about 5% of missing data, I'll delete the mentioned 'Garage
X
' variables.
The same logic applies to 'Bsmt
X
' variables.
Regarding 'MasVnrArea' and 'MasVnrType', we can consider that these variables are not essential.
Furthermore, they have a strong correlation with 'YearBuilt' and 'OverallQual' which are already considered.
Thus, we will not lose information if we delete 'MasVnrArea' and 'MasVnrType'.
Finally, we have one missing observation in 'Electrical'.
Since it is just one observation, we'll delete this observation and keep the variable.
In summary, to handle missing data, we'll delete all the variables with missing data, except the variable 'Electrical'.
In 'Electrical' we'll just delete the observation with missing data.
#dealing with missing data
df_train = df_train.drop (( missing_data [ missing_data [ 'Total' ] > 1 ]).
index , 1 )
df_train = df_train.drop ( df_train.loc [ df_train [ 'Electrical' ].
isnull ()].
index )
df_train.isnull ().sum ().max () #just checking that there's no missing data missing...
0
Out liars!
Outliers is also something that we should be aware of.
Why? Because outliers can markedly affect our models and can be a valuable source of information, providing us insights about specific behaviours.
Outliers is a complex subject and it deserves more attention.
Here, we'll just do a quick analysis through the standard deviation of 'SalePrice' and a set of scatter plots.
Univariate analysis
The primary concern here is to establish a threshold that defines an observation as an outlier.
To do so, we'll standardize the data.
In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1.
#standardizing data
saleprice_scaled = StandardScaler ().fit_transform ( df_train [ 'SalePrice' ][:, np.
newaxis ]);
low_range = saleprice_scaled [ saleprice_scaled [:, 0 ].
argsort ()][: 10 ]
high_range = saleprice_scaled [ saleprice_scaled [:, 0 ].
argsort ()][ - 10 :]
print(( 'outer range (low) of the distribution:' ))
print(( low_range ))
print(( ' \n outer range (high) of the distribution:' ))
print(( high_range ))
outer range (low) of the distribution:
[[-1.83820775]
[-1.83303414]
[-1.80044422]
[-1.78282123]
[-1.77400974]
[-1.62295562]
[-1.6166617 ]
[-1.58519209]
[-1.58519209]
[-1.57269236]]
outer range (high) of the distribution:
[[3.82758058]
[4.0395221 ]
[4.49473628]
[4.70872962]
[4.728631 ]
[5.06034585]
[5.42191907]
[5.58987866]
[7.10041987]
[7.22629831]]
How 'SalePrice' looks with her new clothes:
Low range values are similar and not too far from 0.
High range values are far from 0 and the 7.something values are really out of range.
For now, we'll not consider any of these values as an outlier but we should be careful with those two 7.something values.
Bivariate analysis
We already know the following scatter plots by heart.
However, when we look to things from a new perspective, there's always something to discover.
As Alan Kay said, 'a change in perspective is worth 80 IQ points'.
#bivariate analysis saleprice/grlivarea
var = 'GrLivArea'
data = pd.concat ([ df_train [ 'SalePrice' ], df_train [ var ]], axis = 1 )
data.
plot.
scatter ( x = var , y = 'SalePrice' , ylim = ( 0 , 800000 ));
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
What has been revealed:
The two values with bigger 'GrLivArea' seem strange and they are not following the crowd.
We can speculate why this is happening.
Maybe they refer to agricultural area and that could explain the low price.
I'm not sure about this but I'm quite confident that these two points are not representative of the typical case.
Therefore, we'll define them as outliers and delete them.
The two observations in the top of the plot are those 7.something observations that we said we should be careful about.
They look like two special cases, however they seem to be following the trend.
For that reason, we will keep them.
#deleting points
df_train.sort_values ( by = 'GrLivArea' , ascending = False )[: 2 ]
df_train = df_train.drop ( df_train [ df_train [ 'Id' ] == 1299 ].
index )
df_train = df_train.drop ( df_train [ df_train [ 'Id' ] == 524 ].
index )
#bivariate analysis saleprice/grlivarea
var = 'TotalBsmtSF'
data = pd.concat ([ df_train [ 'SalePrice' ], df_train [ var ]], axis = 1 )
data.
plot.
scatter ( x = var , y = 'SalePrice' , ylim = ( 0 , 800000 ));
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.
Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
We can feel tempted to eliminate some observations (e.g. TotalBsmtSF > 3000) but I suppose it's not worth it.
We can live with that, so we'll not do anything.
5.
Getting hard core
In Ayn Rand's novel, 'Atlas Shrugged', there is an often-repeated question: who is John Galt? A big part of the book is about the quest to discover the answer to this question.
I feel Randian now.
Who is 'SalePrice'?
The answer to this question lies in testing for the assumptions underlying the statistical bases for multivariate analysis.
We already did some data cleaning and discovered a lot about 'SalePrice'.
Now it's time to go deep and understand how 'SalePrice' complies with the statistical assumptions that enables us to apply multivariate techniques.
According to
Hair et al. (2013)
, four assumptions should be tested:
Normality - When we talk about normality what we mean is that the data should look like a normal distribution.
This is important because several statistic tests rely on this (e.g. t-statistics).
In this exercise we'll just check univariate normality for 'SalePrice' (which is a limited approach).
Remember that univariate normality doesn't ensure multivariate normality (which is what we would like to have), but it helps.
Another detail to take into account is that in big samples (>200 observations) normality is not such an issue.
However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that's the main reason why we are doing this analysis.
Homoscedasticity - I just hope I wrote it right.
Homoscedasticity refers to the 'assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)'
(Hair et al., 2013)
Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.
Linearity
- The most common way to assess linearity is to examine scatter plots and search for linear patterns.
If patterns are not linear, it would be worthwhile to explore data transformations.
However, we'll not get into this because most of the scatter plots we've seen appear to have linear relationships.
Absence of correlated errors - Correlated errors, like the definition suggests, happen when one error is correlated to another.
For instance, if one positive error makes a negative error systematically, it means that there's a relationship between these variables.
This occurs often in time series, where some patterns are time related.
We'll also not get into this.
However, if you detect something, try to add a variable that can explain the effect you're getting.
That's the most common solution for correlated errors.
What do you think Elvis would say about this long explanation? 'A little less conversation, a little more action please'? Probably...
By the way, do you know what was Elvis's last great hit?
(...)
The bathroom floor.
In the search for normality
The point here is to test 'SalePrice' in a very lean way.
We'll do this paying attention to:
Histogram - Kurtosis and skewness.
Normal probability plot - Data distribution should closely follow the diagonal that represents the normal distribution.
#histogram and normal probability plot
sns.distplot ( df_train [ 'SalePrice' ], fit = norm );
fig = plt.figure ()
res = stats.
probplot ( df_train [ 'SalePrice' ], plot = plt )
Ok, 'SalePrice' is not normal.
It shows 'peakedness', positive skewness and does not follow the diagonal line.
But everything's not lost.
A simple data transformation can solve the problem.
This is one of the awesome things you can learn in statistical books: in case of positive skewness, log transformations usually works well.
When I discovered this, I felt like an Hogwarts' student discovering a new cool spell.
Avada kedavra!
#applying log transformation
df_train [ 'SalePrice' ] = np.
log ( df_train [ 'SalePrice' ])
#transformed histogram and normal probability plot
sns.distplot ( df_train [ 'SalePrice' ], fit = norm );
fig = plt.figure ()
res = stats.
probplot ( df_train [ 'SalePrice' ], plot = plt )
Done! Let's check what's going on with 'GrLivArea'.
#histogram and normal probability plot
sns.distplot ( df_train [ 'GrLivArea' ], fit = norm );
fig = plt.figure ()
res = stats.
probplot ( df_train [ 'GrLivArea' ], plot = plt )
Tastes like skewness...
Avada kedavra!
#data transformation
df_train [ 'GrLivArea' ] = np.
log ( df_train [ 'GrLivArea' ])
#transformed histogram and normal probability plot
sns.distplot ( df_train [ 'GrLivArea' ], fit = norm );
fig = plt.figure ()
res = stats.
probplot ( df_train [ 'GrLivArea' ], plot = plt )
Next, please...
#histogram and normal probability plot
sns.distplot ( df_train [ 'TotalBsmtSF' ], fit = norm );
fig = plt.figure ()
res = stats.
probplot ( df_train [ 'TotalBsmtSF' ], plot = plt )
Ok, now we are dealing with the big boss.
What do we have here?
Something that, in general, presents skewness.
A significant number of observations with value zero (houses without basement).
A big problem because the value zero doesn't allow us to do log transformations.
To apply a log transformation here, we'll create a variable that can get the effect of having or not having basement (binary variable).
Then, we'll do a log transformation to all the non-zero observations, ignoring those with value zero.
This way we can transform data, without losing the effect of having or not basement.
I'm not sure if this approach is correct.
It just seemed right to me.
That's what I call 'high risk engineering'.
#create column for new variable (one is enough because it's a binary categorical feature)
#if area>0 it gets 1, for area==0 it gets 0
df_train [ 'HasBsmt' ] = pd.Series ( len ( df_train [ 'TotalBsmtSF' ]), index = df_train.index )
df_train [ 'HasBsmt' ] = 0
df_train.loc [ df_train [ 'TotalBsmtSF' ] > 0 , 'HasBsmt' ] = 1
#transform data
df_train.loc [ df_train [ 'HasBsmt' ] == 1 , 'TotalBsmtSF' ] = np.
log ( df_train [ 'TotalBsmtSF' ])
#histogram and normal probability plot
sns.distplot ( df_train [ df_train [ 'TotalBsmtSF' ] > 0 ][ 'TotalBsmtSF' ], fit = norm );
fig = plt.figure ()
res = stats.
probplot ( df_train [ df_train [ 'TotalBsmtSF' ] > 0 ][ 'TotalBsmtSF' ], plot = plt )
In the search for writing 'homoscedasticity' right at the first attempt
The best approach to test homoscedasticity for two metric variables is graphically.
Departures from an equal dispersion are shown by such shapes as cones (small dispersion at one side of the graph, large dispersion at the opposite side) or diamonds (a large number of points at the center of the distribution).
Starting by 'SalePrice' and 'GrLivArea'...
#scatter plot
plt.scatter ( df_train [ 'GrLivArea' ], df_train [ 'SalePrice' ]);
Older versions of this scatter plot (previous to log transformations), had a conic shape (go back and check 'Scatter plots between 'SalePrice' and correlated variables (move like Jagger style)').
As you can see, the current scatter plot doesn't have a conic shape anymore.
That's the power of normality! Just by ensuring normality in some variables, we solved the homoscedasticity problem.
Now let's check 'SalePrice' with 'TotalBsmtSF'.
#scatter plot
plt.scatter ( df_train [ df_train [ 'TotalBsmtSF' ] > 0 ][ 'TotalBsmtSF' ], df_train [ df_train [ 'TotalBsmtSF' ] > 0 ][ 'SalePrice' ]);
We can say that, in general, 'SalePrice' exhibit equal levels of variance across the range of 'TotalBsmtSF'.
Cool!
That's it! We reached the end of our exercise.
Throughout this kernel we put in practice many of the strategies proposed by
Hair et al. (2013).
We philosophied about the variables, we analysed 'SalePrice' alone and with the most correlated variables, we dealt with missing data and outliers, we tested some of the fundamental statistical assumptions and we even transformed categorial variables into dummy variables.
That's a lot of work that Python helped us make easier.
But the quest is not over.
Remember that our story stopped in the Facebook research.
Now it's time to give a call to 'SalePrice' and invite her to dinner.
Try to predict her behaviour.
Do you think she's a girl that enjoys regularized linear regression approaches? Or do you think she prefers ensemble methods? Or maybe something else?
It's up to you to find out.
To get the docs on all the functions at once, interactively.
print(dir(os)) # show all functions
for i in dir(module): print(i # list out one by one)
The inspect module.
Also see the pydoc module, the help() function in the interactive interpreter and the pydoc command-line tool which generates the documentation you are after.
help(os)
Python elegant way to read lines of file into list
For most cases, to read lines of file to a list
with open(fileName) as f:
lineList = f.readlines()
In this case, every element in the list contain a \n in the end the string, which would be extremely annoying in some cases.
And there will be same problem if you use:
lineList = list()
with open(fileName) as f:
for line in f:
lineList.append(line)
To overcome this, use:
lineList = [line.rstrip('\n') for line in open(fileName)]
Call a function from another file in Python
If you have a file a.py and inside you have some functions:
def b():
# Something
return 1
def c():
# Something
return 2
And you want to import them in z.py you have to write
from a import b, c
None of the high level programming languages invoke a browser instance, they request and extract pure HTML only.
So if we want to access the browser's local storage when scraping a page, we need to invoke both a browser instance and leverage a JavaScript interpreter to read the local storage.
Selenium is the best solution.
A possible replacement for Selenium is PhantomJS, running a headless browser.
JaveScript to iterate over localStorage browser object
for (var i = 0; i < localStorage.length; i++){
key=localStorage.key(i);
console.log(key+': '+localStorage.getItem(key));
}
Advanced script
As mentioned here a HTML5 featured browser should also implement Array.prototype.map.
So script would be:
Array.apply(0, new Array(localStorage.length)).map(function (o, i){
return localStorage.key(i)+':'+localStorage.getItem(localStorage.key(i));
})
Python with Selenium script for setting up and scraping local storage
from selenium import webdriver
driver = webdriver.Firefox()
url='http://www.w3schools.com/'
driver.get(url)
scriptArray="""localStorage.setItem("key1", 'new item');
localStorage.setItem("key2", 'second item');
return Array.apply(0, new Array(localStorage.length)).map(function (o, i) {
return localStorage.getItem(localStorage.key(i)); })"""
result = driver.execute_script(scriptArray)
print(result)
最后进行环境变量配置,也就是将对应的ChromeDriver的可执行文件chromedriver.exe文件拖到Python的Scripts目录下。
注:当然也可以不这样做,但是在调用的时候指定chromedriver.exe绝对路径亦可。
自动安装
自动安装需要用到第三方库webdriver_manager,先安装这个库,然后调用对应的方法即可。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get('http://www.baidu.com')
search = browser.find_element_by_id('kw')
search.send_keys('python')
search.send_keys(Keys.ENTER)
# 关闭浏览器
browser.close()
在上述代码中,ChromeDriverManager().install()方法就是自动安装驱动的操作,它会自动获取当前浏览器的版本并去下载对应的驱动到本地。
====== WebDriver manager ======
Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
There is no [win32] chromedriver for browser in cache
Trying to download new driver from https://chromedriver.storage.googleapis.com/96.0.4664.45/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Gdc\.wdm\drivers\chromedriver\win32\96.0.4664.45]
如果本地已经有该浏览器渠道,则会提示其已存在。
====== WebDriver manager ======
Current google-chrome version is 96.0.4664
Get LATEST driver version for 96.0.4664
Driver [C:\Users\Gdc\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache
搞定以上准备工作,我们就可以开始本文正式内容的学习啦~
1. 基本用法
这节我们就从初始化浏览器对象、访问页面、设置浏览器大小、刷新页面和前进后退等基础操作。
Some might argue Selenium is inefficient for only local storage extracting.
If you think Selenium is too bulky, you might want to try a Python binding with a development framework for desktop, ex. PyQt.
execute_script
Python doesn't provide a way to directly read/write the local storage, but it can be done with execute_script.
driver.execute_script("window.localStorage;")
or:
from selenium import webdriver
wd = webdriver.Firefox()
wd.get("http://localhost/foo/bar")
wd.execute_script("return localStorage.getItem('foo')")
or:
driver.execute_script("window.localStorage.setItem('key','value');");
driver.execute_script("window.localStorage.getItem('key');");
or define class:
class LocalStorage:
def __init__(self, driver) :
self.driver = driver
def __len__(self):
return self.driver.execute_script("return window.localStorage.length;")
def items(self) :
return self.driver.execute_script( \
"var ls = window.localStorage, items = {}; " \
"for (var i = 0, k; i < ls.length; ++i) " \
" items[k = ls.key(i)] = ls.getItem(k); " \
"return items; ")
def keys(self) :
return self.driver.execute_script( \
"var ls = window.localStorage, keys = []; " \
"for (var i = 0; i < ls.length; ++i) " \
" keys[i] = ls.key(i); " \
"return keys; ")
def get(self, key):
return self.driver.execute_script("return window.localStorage.getItem(arguments[0]);", key)
def set(self, key, value):
self.driver.execute_script("window.localStorage.setItem(arguments[0], arguments[1]);", key, value)
def has(self, key):
return key in self.keys()
def remove(self, key):
self.driver.execute_script("window.localStorage.removeItem(arguments[0]);", key)
def clear(self):
self.driver.execute_script("window.localStorage.clear();")
def __getitem__(self, key) :
value = self.get(key)
if value is None :
raise KeyError(key)
return value
def __setitem__(self, key, value):
self.set(key, value)
def __contains__(self, key):
return key in self.keys()
def __iter__(self):
return self.items().__iter__()
def __repr__(self):
return self.items().__str__()
Usage example:
# get the local storage
storage = LocalStorage(driver)
# set an item
storage["mykey"] = 1234
storage.set("mykey2", 5678)
# get an item
print(storage["mykey"]) # raises a KeyError if the key is missing
print(storage.get("mykey")) # returns None if the key is missing
# delete an item
storage.remove("mykey")
# iterate items
for key, value in storage.items():
print("%s: %s" % (key, value))
# delete items
storage.clear()
to list all python packages installed
As of version 1.3 of pip you can now use
pip list
Using help function
help("modules")
using python-pip
pip freeze
pip freeze will output a list of installed packages and their versions.
It also allows you to write those packages to a file that can later be used to set up a new environment.
20 Python libraries you can't live without
Requests Scrapy wxPython Pillow SQLAlchemy BeautifulSoup Twisted
NumPy SciPy matplotlib Pygame Pyglet pyQT pyGtk Scapy
pywin32 nltk nose SymPy IPython
1. Requests.
2. Scrapy. must have library in webscraping
3. wxPython. A gui toolkit for python. I have primarily used it in place of tkinter.
4. Pillow. A friendly fork of PIL (Python Imaging Library). It is more user friendly than PIL and is a must have for anyone who works with images.
5. SQLAlchemy. A database library. Many love it and many hate it.
6. BeautifulSoup. I know it’s slow but this xml and html parsing library is very useful for beginners.
7. Twisted. The most important tool for any network application developer. It has a very beautiful api.
8. NumPy. provides advance math functionalities to python.
9. SciPy. When we talk about NumPy then we have to talk about scipy. It is a library of algorithms and mathematical tools for python and has caused many scientists to switch from ruby to python.
10. matplotlib. A numerical plotting library. It is very useful for any data scientist or any data analyzer.
11. Pygame. game development.
12. Pyglet. A 3d animation and game creation engine. This is the engine in which the famous python port of minecraft was made
13. pyQT. A GUI toolkit for python
14. pyGtk. Another python GUI library
15. Scapy. A packet sniffer and analyzer for python made in python.
16. pywin32. A python library which provides some useful methods and classes for interacting with windows.
17. nltk. Natural Language Toolkit – I realize most people won’t be using this one, but it’s generic enough. It is a very useful library if you want to manipulate strings. But it’s capacity is beyond that. Do check it out.
18. nose. A testing framework for python. It is used by millions of python developers. It is a must have if you do test driven development.
19. SymPy. SymPy can do algebraic evaluation, differentiation, expansion, complex numbers, etc. It is contained in a pure Python distribution.
20. IPython. It is a python prompt on steroids. It has completion, history, shell capabilities, and a lot more. Make sure that you take a look at it.
Installed Python packages:
IPython brain_curses lazy_object_proxy sqlite3
PdbSublimeTextSupport brain_dateutil lesscpy sre_compile
PyInstaller brain_fstrings lib2to3 sre_constants
PyQt5 brain_functools libfuturize sre_parse
Radiobutton brain_gi libpasteurize ssl
__future__ brain_hashlib linecache sspi
_ast brain_http lineedit sspicon
_asyncio brain_io locale stat
_asyncio_d brain_mechanize logging statistics
_bisect brain_multiprocessing lzma storemagic
_blake2 brain_namedtuple_enum macpath string
_bootlocale brain_nose macurl2path stringprep
_bz2 brain_numpy mailbox struct
_bz2_d brain_pkg_resources mailcap subprocess
_codecs brain_pytest markupsafe sunau
_codecs_cn brain_qt marshal symbol
_codecs_hk brain_random math sympy
_codecs_iso2022 brain_re matplotlib sympyprinting
_codecs_jp brain_six mccabe symtable
_codecs_kr brain_ssl mimetypes sys
_codecs_tw brain_subprocess mistune sysconfig
_collections brain_threading mmap tabnanny
_collections_abc brain_typing mmapfile tarfile
_compat_pickle brain_uuid mmsystem telnetlib
_compression builtins modulefinder tempfile
_csv bz2 more_itertools tensorflow
_ctypes cProfile mpmath terminado
_ctypes_d cachetools msilib test
_ctypes_test calendar msvcrt testpath
_ctypes_test_d certifi multiprocessing tests
_datetime cgi nbconvert textwrap
_decimal cgitb nbformat this
_decimal_d chardet netbios threading
_dummy_thread chunk netrc time
_elementtree click nntplib timeit
_elementtree_d cmath nose timer
_findvs cmd notebook tkinter
_functools code nt token
_hashlib codecs ntpath tokenize
_hashlib_d codeop ntsecuritycon toml
_heapq collections nturl2path tornado
_imp colorama numbers trace
_io colorsys numpy traceback
_json commctrl odbc tracemalloc
_locale compileall opcode traitlets
_lsprof concurrent operator tty
_lzma configparser optparse turtle
_lzma_d contextlib ordlookup turtledemo
_markupbase copy os typed_ast
_md5 copyreg pandas types
_msi crypt pandocfilters typing
_msi_d csv parser unicodedata
_multibytecodec ctypes parso unicodedata_d
_multiprocessing curses past unittest
_multiprocessing_d cycler pathlib uritemplate
_opcode cythonmagic pdb urllib
_operator datetime pefile urllib3
_osx_support dateutil perfmon uu
_overlapped dbi peutils uuid
_overlapped_d dbm pickle venv
_pickle dde pickleshare warnings
_pydecimal decimal pickletools wave
_pyio decorator pip wcwidth
_pyrsistent_version difflib pipes weakref
_random dis pkg_resources webbrowser
_sha1 distutils pkgutil webencodings
_sha256 doctest platform wheel
_sha3 docutils plistlib widgetsnbextension
_sha512 dotenv ply win2kras
_signal dummy_threading poplib win32api
_sitebuiltins easy_install posixpath win32clipboard
_socket email pprint( win32com)
_socket_d encodings profile win32con
_sqlite3 ensurepip progressbar win32console
_sqlite3_d entrypoints prometheus_client win32cred
_sre enum prompt_toolkit win32crypt
_ssl errno pstats win32cryptcon
_ssl_d external pty win32ctypes
_stat faulthandler py_compile win32event
_string filecmp pyasn1 win32evtlog
_strptime fileinput pyasn1_modules win32evtlogutil
_struct fnmatch pyclbr win32file
_symtable formatter pydoc win32gui
_testbuffer fractions pydoc_data win32gui_struct
_testbuffer_d ftplib pyexpat win32help
_testcapi functools pyexpat_d win32inet
_testcapi_d future pygments win32inetcon
_testconsole garden pylab win32job
_testconsole_d gc pylint win32lz
_testimportmultiple genericpath pymysql win32net
_testimportmultiple_d getopt pyparsing win32netcon
_testmultiphase getpass pyqt5_tools win32pdh
_testmultiphase_d gettext pyrsistent win32pdhquery
_thread glob pysrt win32pdhutil
_threading_local google_auth_httplib2 pythoncom win32pipe
_tkinter googleapiclient pytz win32print
_tkinter_d gzip pywin win32process
_tracemalloc hashlib pywin32_testutil win32profile
_warnings heapq pywintypes win32ras
_weakref hmac qtconsole win32rcparser
_weakrefset html queue win32security
_win32sysloader html5lib quopri win32service
_winapi http radian win32serviceutil
_winxptheme httplib2 random win32timezone
abc idlelib rasutil win32trace
adodbapi idna rchitect win32traceutil
afxres imaplib re win32transaction
aifc imghdr regcheck win32ts
altgraph imp regutil win32ui
antigravity importlib reprlib win32uiole
apiclient importlib_metadata requests win32verstamp
appdirs inspect rlcompleter win32wnet
argparse inventryList rmagic winerror
array io rsa winioctlcon
ast ipaddress runpy winnt
astroid ipykernel sched winperf
asynchat ipykernel_launcher scipy winpty
asyncio ipython_genutils secrets winreg
asyncore ipywidgets select winsound
atexit isapi select_d winsound_d
attr isort selectors winxpgui
audioop itertools selenium winxptheme
autoreload jedi send2trash wrapt
autosub jinja2 servicemanager wsgiref
base64 json setuptools xdrlib
bdb json5 shelve xml
binascii jsonschema shlex xmlrpc
binhex jupyter shutil xxsubtype
bisect jupyter_client signal zipapp
black jupyter_console simplegeneric zipfile
blackd jupyter_core site zipimport
bleach jupyterlab six zipp
blib2to3 jupyterlab_server smtpd zlib
brain_argparse jupyterthemes smtplib zmq
brain_attrs keyword sndhdr
brain_builtin_inference kivy socket
brain_collections kivy_deps socketserver wxpython
check version: python --version
first Django app
first Django app
https://docs.djangoproject.com/en/3.0/intro/tutorial01/
Writing your first Django app, part 1
Check Django is installed
$ python -m django --version
Install Django
$ pip install Django
Creat project
cd into a directory where you’d like to store your code
$ django-admin startproject mysite
startproject created:
mysite/
manage.py
mysite/
__init__.py
settings.py
urls.py
wsgi.py
manage.py: A command-line utility that lets you interact with this Django project in various ways.
You can read all the details about manage.py in django-admin and manage.py.
The inner mysite/ directory is the actual Python package for your project.
Its name is the Python package name you’ll need to use to import anything inside it (e.g. mysite.urls).
mysite/__init__.py: An empty file that tells Python that this directory should be considered a Python package.
If you’re a Python beginner, read more about packages in the official Python docs.
mysite/settings.py: Settings/configuration for this Django project.
Django settings will tell you all about how settings work.
mysite/urls.py: The URL declarations for this Django project; a “table of contents” of your Django-powered site.
You can read more about URLs in URL dispatcher.
mysite/wsgi.py: An entry-point for WSGI-compatible web servers to serve your project.
Change into the outer mysite directory and run the following commands:
$ python manage.py runserver
The Django development server started.
Visit http://127.0.0.1:8000/ with your Web browser to see a “Congratulations!” page!
Changing the port
$ python manage.py runserver 8080
To listen on all available public IPs (which is useful if you are running Vagrant or want to show off your work on other computers on the network), use:
$ python manage.py runserver 0:8000
0 is a shortcut for 0.0.0.0.
To create an app, type this:
$ python manage.py startapp polls
Directory polls created:
polls/
__init__.py
admin.py
apps.py
migrations/
__init__.py
models.py
tests.py
views.py
Write the first view:
polls/views.py
from django.http import HttpResponse
def index(request):
return HttpResponse("Hello, world. You're at the polls index.")
To call the view, we need to map it to a URL - and for this we need a URLconf.
To create a URLconf in the polls directory, create a file called urls.py.
polls/urls.py
from django.urls import path
from . import views
urlpatterns = [
path('', views.index, name='index'),
]
The next step is to point the root URLconf at the polls.urls module.
In mysite/urls.py, add an import for django.urls.include and insert an include() in the urlpatterns list, so you have:
mysite/urls.py
from django.contrib import admin
from django.urls import include, path
urlpatterns = [
path('polls/', include('polls.urls')),
path('admin/', admin.site.urls),
]
It turns out I was confused because of the multiple directories named "mysite".
I wrongly created a urls.py file in the root "mysite" directory (which contains "manage.py"), then pasted in the code from the website.
To correct it I deleted this file, went into the mysite/mysite directory (which contains "settings.py"), modified the existing "urls.py" file, and replaced the code with the tutorial code.
Guessing on the basis of whatever little information provided in the question, i think you might have forget to add the following import in your urls.py file.
from django.conf.urls import include
Logging in Python
Logging in Python
# purpose of logging: record progress and problems
# 5 levels of logging: notset, debug, info, warning, error, critical
# 0 , 10, 20, 30, 40, 50
import logging
dir(logging) # check what is inside
import math
# create log format to show more details
# LOG_FORMAT = "%(Levelname)s %(asctime)s - %(message)s"
LOG_FORMAT = "%(levelname)s %(asctime)s - %(message)s"
# create and configure logger
logging.basicConfig(filename = "C:\\Users\\User\\Desktop\\logfile.txt",
level=logging.DEBUG, # this set the level to record
format = LOG_FORMAT, # this set the output msg format
filemode= "w") # this starts a blank file
logger = logging.getLogger() # create logger object
# test the logger
# logger.debug("harmless message.")
# logger.info("just some message.")
# logger.warning("warning message.")
# logger.error("error message.")
# logger.critical("thecritical message.")
# print(logger.level)
def quadratic(a, b, c):
""" return the quadratic solution ax^2 + bx + c =0. """
logger.info("quadratic({0},{1},{2})".format(a, b, c))
# compute the discriminant
logger.debug("# compute the discriminant")
disc = b**2 - 4*a*c
# compute the two roots
logger.debug("# compute the two roots")
root1 = (-b + math.sqrt(disc)) / (2*a)
root2 = (-b - math.sqrt(disc)) / (2*a)
# return the roots
logger.debug("# return the roots")
return (root1, root2)
roots = quadratic(1,0,1)
print(roots)
元旦过完了,我们都纷纷回到了各自的工作岗位。
新的一年新气象,我想借本文为大家献上 Python 语言的30个最佳实践、小贴士和技巧,希望能对各位勤劳的程序员有所帮助,并希望大家工作顺利!
1. Python 版本
在此想提醒各位:自2020年1月1日起,Python 官方不再支持 Python 2。
本文中的很多示例只能在 Python 3 中运行。
如果你仍在使用 Python 2.7,请立即升级。
2. 检查 Python 的最低版本
你可以在代码中检查 Python 的版本,以确保你的用户没有在不兼容的版本中运行脚本。
检查方式如下:
if not sys.version_info > (2, 7):
# berate your user for running a 10 year
# python version
elif not sys.version_info >= (3, 5):
# Kindly tell your user (s)he needs to upgrade
# because you're using 3.5 features
3. IPython
IPython 本质上就是一个增强版的shell。
就冲着自动补齐就值得一试,而且它的功能还不止于此,它还有很多令我爱不释手的命令,例如:
%cd:改变当前的工作目录
%edit:打开编辑器,并关闭编辑器后执行键入的代码
%env:显示当前环境变量
%pip install [pkgs]:无需离开交互式shell,就可以安装软件包
%time 和 %timeit:测量执行Python代码的时间
完整的命令列表,请点击此处查看(https://ipython.readthedocs.io/en/stable/interactive/magics.html)。
还有一个非常实用的功能:引用上一个命令的输出。
In 和 Out 是实际的对象。
你可以通过 Out[3] 的形式使用第三个命令的输出。
IPython 的安装命令如下:
pip3 install ipython
4. 列表推导式
你可以利用列表推导式,避免使用循环填充列表时的繁琐。
列表推导式的基本语法如下:
[ expression for item in list if conditional ]
举一个基本的例子:用一组有序数字填充一个列表:
mylist = [i for i in range(10)]
print(mylist)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
由于可以使用表达式,所以你也可以做一些算术运算:
squares = [x**2 for x in range(10)]
print(squares)
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
甚至可以调用外部函数:
def some_function(a):
return (a + 5) / 2
my_formula = [some_function(i) for i in range(10)]
print(my_formula)
# [2, 3, 3, 4, 4, 5, 5, 6, 6, 7]
最后,你还可以使用 ‘if’ 来过滤列表。
在如下示例中,我们只保留能被2整除的数字:
filtered = [i for i in range(20) if i%2==0]
print(filtered)
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
5. 检查对象使用内存的状况
你可以利用 sys.getsizeof() 来检查对象使用内存的状况:
import sys
mylist = range(0, 10000)
print(sys.getsizeof(mylist))
# 48
等等,为什么这个巨大的列表仅包含48个字节?
因为这里的 range 函数返回了一个类,只不过它的行为就像一个列表。
在使用内存方面,range 远比实际的数字列表更加高效。
你可以试试看使用列表推导式创建一个范围相同的数字列表:
import sys
myreallist = [x for x in range(0, 10000)]
print(sys.getsizeof(myreallist))
# 87632
6. 返回多个值
Python 中的函数可以返回一个以上的变量,而且还无需使用字典、列表或类。
如下所示:
def get_user(id):
# fetch user from database
# ....
return name, birthdate
name, birthdate = get_user(4)
如果返回值的数量有限当然没问题。
但是,如果返回值的数量超过3个,那么你就应该将返回值放入一个(数据)类中。
7. 使用数据类
Python从版本3.7开始提供数据类。
与常规类或其他方法(比如返回多个值或字典)相比,数据类有几个明显的优势:
数据类的代码量较少
你可以比较数据类,因为数据类提供了 __eq__ 方法
调试的时候,你可以轻松地输出数据类,因为数据类还提供了 __repr__ 方法
数据类需要类型提示,因此可以减少Bug的发生几率
数据类的示例如下:
from dataclasses import dataclass
@dataclass
class Card:
rank: str
suit: str
card = Card("Q", "hearts")
print(card == card)
# True
print(card.rank)
# 'Q'
print(card)
Card(rank='Q', suit='hearts')
详细的使用指南请点击这里(https://realpython.com/python-data-classes/)。
8. 交换变量
如下的小技巧很巧妙,可以为你节省多行代码:
a = 1
b = 2
a, b = b, a
print((a))
# 2
print((b))
# 1
9. 合并字典(Python 3.5以上的版本)
从Python 3.5开始,合并字典的操作更加简单了:
dict1 = { 'a': 1, 'b': 2 }
dict2 = { 'b': 3, 'c': 4 }
merged = { **dict1, **dict2 }
print((merged))
# {'a': 1, 'b': 3, 'c': 4}
如果 key 重复,那么第一个字典中的 key 会被覆盖。
10. 字符串的首字母大写
如下技巧真是一个小可爱:
mystring = "10 awesome python tricks"
print(mystring.title())
'10 Awesome Python Tricks'
11. 将字符串分割成列表
你可以将字符串分割成一个字符串列表。
在如下示例中,我们利用空格分割各个单词:
mystring = "The quick brown fox"
mylist = mystring.split(' ')
print(mylist)
# ['The', 'quick', 'brown', 'fox']
12. 根据字符串列表创建字符串
与上述技巧相反,我们可以根据字符串列表创建字符串,然后在各个单词之间加入空格:
mylist = ['The', 'quick', 'brown', 'fox']
mystring = " ".join(mylist)
print(mystring)
# 'The quick brown fox'
你可能会问为什么不是 mylist.join(" "),这是个好问题!
根本原因在于,函数 String.join() 不仅可以联接列表,而且还可以联接任何可迭代对象。
将其放在String中是为了避免在多个地方重复实现同一个功能。
13. 表情符
有些人非常喜欢表情符,而有些人则深恶痛绝。
我在此郑重声明:在分析社交媒体数据时,表情符可以派上大用场。
首先,我们来安装表情符模块:
pip3 install emoji
安装完成后,你可以按照如下方式使用:
import emoji
result = emoji.emojize('Python is :thumbs_up:')
print(result)
# 'Python is 👍'
# You can also reverse this:
result = emoji.demojize('Python is 👍')
print(result)
# 'Python is :thumbs_up:'
更多有关表情符的示例和文档,请点击此处(https://pypi.org/project/emoji/)。
14. 列表切片
列表切片的基本语法如下:
a[start:stop:step]
start、stop 和 step 都是可选项。
如果不指定,则会使用如下默认值:
start:0
end:字符串的结尾
step:1
示例如下:
# We can easily create a new list from
# the first two elements of a list:
first_two = [1, 2, 3, 4, 5][0:2]
print(first_two)
# [1, 2]
# And if we use a step value of 2,
# we can skip over every second number
# like this:
steps = [1, 2, 3, 4, 5][0:5:2]
print(steps)
# [1, 3, 5]
# This works on strings too. In Python,
# you can treat a string like a list of
# letters:
mystring = "abcdefdn nimt"[::2]
print(mystring)
# 'aced it'
15. 反转字符串和列表
你可以利用如上切片的方法来反转字符串或列表。
只需指定 step 为 -1,就可以反转其中的元素:
revstring = "abcdefg"[::-1]
print(revstring)
# 'gfedcba'
revarray = [1, 2, 3, 4, 5][::-1]
print(revarray)
# [5, 4, 3, 2, 1]
16. 显示猫猫
我终于找到了一个充分的借口可以在我的文章中显示猫猫了,哈哈!当然,你也可以利用它来显示图片。
首先你需要安装 Pillow,这是一个 Python 图片库的分支:
pip3 install Pillow
接下来,你可以将如下图片下载到一个名叫 kittens.jpg 的文件中:
然后,你就可以通过如下 Python 代码显示上面的图片:
from PIL import Image
im = Image.open("kittens.jpg")
im.show()
print(im.format, im.size, im.mode)
# JPEG (1920, 1357) RGB
Pillow 还有很多显示该图片之外的功能。
它可以分析、调整大小、过滤、增强、变形等等。
完整的文档,请点击这里(https://pillow.readthedocs.io/en/stable/)。
17. map()
Python 有一个自带的函数叫做 map(),语法如下:
map(function, something_iterable)
所以,你需要指定一个函数来执行,或者一些东西来执行。
任何可迭代对象都可以。
在如下示例中,我指定了一个列表:
def upper(s):
return s.upper()
mylist = list(map(upper, ['sentence', 'fragment']))
print(mylist)
# ['SENTENCE', 'FRAGMENT']
# Convert a string representation of
# a number into a list of ints.
list_of_ints = list(map(int, "1234567")))
print(list_of_ints)
# [1, 2, 3, 4, 5, 6, 7]
你可以仔细看看自己的代码,看看能不能用 map() 替代某处的循环。
18. 获取列表或字符串中的唯一元素
如果你利用函数 set() 创建一个集合,就可以获取某个列表或类似于列表的对象的唯一元素:
mylist = [1, 1, 2, 3, 4, 5, 5, 5, 6, 6]
print((set(mylist)))
# {1, 2, 3, 4, 5, 6}
# And since a string can be treated like a
# list of letters, you can also get the
# unique letters from a string this way:
print((set("aaabbbcccdddeeefff")))
# {'a', 'b', 'c', 'd', 'e', 'f'}
19. 查找出现频率最高的值
你可以通过如下方法查找出现频率最高的值:
test = [1, 2, 3, 4, 2, 2, 3, 1, 4, 4, 4]
print(max(set(test), key = test.count))
# 4
你能看懂上述代码吗?想法搞明白上述代码再往下读。
没看懂?我来告诉你吧:
max() 会返回列表的最大值。
参数 key 会接受一个参数函数来自定义排序,在本例中为 test.count。
该函数会应用于迭代对象的每一项。
test.count 是 list 的内置函数。
它接受一个参数,而且还会计算该参数的出现次数。
因此,test.count(1) 将返回2,而 test.count(4) 将返回4。
set(test) 将返回 test 中所有的唯一值,也就是 {1, 2, 3, 4}。
因此,这一行代码完成的操作是:首先获取 test 所有的唯一值,即{1, 2, 3, 4};然后,max 会针对每一个值执行 list.count,并返回最大值。
这一行代码可不是我个人的发明。
20. 创建一个进度条
你可以创建自己的进度条,听起来很有意思。
但是,更简单的方法是使用 progress 包:
pip3 install progress
接下来,你就可以轻松地创建进度条了:
from progress.bar import Bar
bar = Bar('Processing', max=20)
for i in range(20):
# Do some work
bar.next()
bar.finish()
21. 在交互式shell中使用_(下划线运算符)
你可以通过下划线运算符获取上一个表达式的结果,例如在 IPython 中,你可以这样操作:
In [1]: 3 * 3
Out[1]: 9In [2]: _ + 3
Out[2]: 12
Python Shell 中也可以这样使用。
另外,在 IPython shell 中,你还可以通过 Out[n] 获取表达式 In[n] 的值。
例如,在如上示例中,Out[1] 将返回数字9。
22. 快速创建Web服务器
你可以快速启动一个Web服务,并提供当前目录的内容:
python3 -m http.server
当你想与同事共享某个文件,或测试某个简单的HTML网站时,就可以考虑这个方法。
23. 多行字符串
虽然你可以用三重引号将代码中的多行字符串括起来,但是这种做法并不理想。
所有放在三重引号之间的内容都会成为字符串,包括代码的格式,如下所示。
我更喜欢另一种方法,这种方法不仅可以将多行字符串连接在一起,而且还可以保证代码的整洁。
唯一的缺点是你需要明确指定换行符。
s1 = """Multi line strings can be put
between triple quotes. It's not ideal
when formatting your code though"""
print((s1))
# Multi line strings can be put
# between triple quotes. It's not ideal
# when formatting your code though
s2 = ("You can also concatenate multiple\n" +
"strings this way, but you'll have to\n"
"explicitly put in the newlines")
print(s2)
# You can also concatenate multiple
# strings this way, but you'll have to
# explicitly put in the newlines
24. 条件赋值中的三元运算符
这种方法可以让代码更简洁,同时又可以保证代码的可读性:
[on_true] if [expression] else [on_false]
示例如下:
x = "Success!" if (y == 2) else "Failed!"
25. 统计元素的出现次数
你可以使用集合库中的 Counter 来获取列表中所有唯一元素的出现次数,Counter 会返回一个字典:
from collections import Counter
mylist = [1, 1, 2, 3, 4, 5, 5, 5, 6, 6]
c = Counter(mylist)
print(c)
# Counter({1: 2, 2: 1, 3: 1, 4: 1, 5: 3, 6: 2})
# And it works on strings too:
print(Counter("aaaaabbbbbccccc"))
# Counter({'a': 5, 'b': 5, 'c': 5})
26. 比较运算符的链接
你可以在 Python 中将多个比较运算符链接到一起,如此就可以创建更易读、更简洁的代码:
x = 10
# Instead of:
if x > 5 and x < 15:
print("Yes")
# yes
# You can also write:
if 5 < x < 15:
print("Yes")
# Yes
27. 添加颜色
你可以通过 Colorama,设置终端的显示颜色:
from colorama import Fore, Back, Style
print(Fore.RED + 'some red text')
print(Back.GREEN + 'and with a green background')
print(Style.DIM + 'and in dim text')
print(Style.RESET_ALL)
print('back to normal now')
28. 日期的处理
python-dateutil 模块作为标准日期模块的补充,提供了非常强大的扩展,你可以通过如下命令安装:
pip3 install python-dateutil
你可以利用该库完成很多神奇的操作。
在此我只举一个例子:模糊分析日志文件中的日期:
from dateutil.parser import parse
logline = 'INFO 2020-01-01T00:00:01 Happy new year, human.'
timestamp = parse(log_line, fuzzy=True)
print(timestamp)
# 2020-01-01 00:00:01
你只需记住:当遇到常规 Python 日期时间功能无法解决的问题时,就可以考虑 python-dateutil !
29.整数除法
在 Python 2 中,除法运算符(/)默认为整数除法,除非其中一个操作数是浮点数。
因此,你可以这么写:
# Python 2
5 / 2 = 2
5 / 2.0 = 2.5
在 Python 3 中,除法运算符(/)默认为浮点除法,而整数除法的运算符为 //。
因此,你需要这么写:
Python 3
5 / 2 = 2.5
5 // 2 = 2
这项变更背后的动机,请参阅 PEP-0238(https://www.python.org/dev/peps/pep-0238/)。
30. 通过chardet 来检测字符集
你可以使用 chardet 模块来检测文件的字符集。
在分析大量随机文本时,这个模块十分实用。
安装方法如下:
pip install chardet
安装完成后,你就可以使用命令行工具 chardetect 了,使用方法如下:
chardetect somefile.txt
somefile.txt: ascii with confidence 1.0
你也可以在编程中使用该库,完整的文档请点击这里(https://chardet.readthedocs.io/en/latest/usage.html)。
Python Creating a Menu
def menu :
print("welcome, \n Option 1\n Option 2\n Option 3\n")
choice = input()
if choice == "1";
print("Option 1")
menu()
if choice == "2";
print("Option 2")
menu()
if choice == "3";
print("Option 3")
menu()
menu()
Python Lambda Functions
Python Lambda Functions
Lambdas, also known as anonymous functions, are small, restricted functions which do not need a name (i.e., an identifier).
Today, many modern programming languages like Java, Python, C#, and C++ support lambda functions to add functionality to the languages.
Syntax and Examples
lambda arguments : expression
lambda p1, p2: expression
x = lambda a : a + 10
print(x(5))
15
adder = lambda x, y: x + y
print((adder (1, 2)))
3
#A REGULAR FUNCTION
def guru( funct, *args ):
funct( *args )
def printer_one( arg ):
return print((arg))
def printer_two( arg ):
print(arg)
#CALL A REGULAR FUNCTION
guru( printer_one, 'printer 1 REGULAR CALL' )
guru( printer_two, 'printer 2 REGULAR CALL \n' )
#CALL A REGULAR FUNCTION THRU A LAMBDA
guru(lambda: printer_one('printer 1 LAMBDA CALL'))
guru(lambda: printer_two('printer 2 LAMBDA CALL'))
mysql
import mysql.connector
db = mysql.connector.connect(
host="localhost",
user="root",
passwd="asdf1234",
database="demo"
)
mycursor = db.cursor()
#mycursor.execute("CREATE TABLE urlTable (titleName varchar(50), urlAddr varchar(100), id int PRIMARY KEY )")
mycursor.execute("INSERT INTO urlTable (titleName, urlAddr) VALUES (%s,%s), ('google', 'google.com')")
db.commit()
subprocess module
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions:
os.system
os.spawn*
The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.
subprocess.run(args, *, stdin=None, input=None, stdout=None, stderr=None, capture_output=False, shell=False, cwd=None, timeout=None, check=False, encoding=None, errors=None, text=None, env=None, universal_newlines=None)
Examples:
>>> subprocess.run(["ls", "-l"]) # doesn't capture output
>>> subprocess.run("exit 1", shell=True, check=True)
>>> subprocess.run(["ls", "-l", "/dev/null"], capture_output=True)
CompletedProcess(args=['ls', '-l', '/dev/null'], returncode=0,
stdout=b'crw-rw-rw- 1 root root 1, 3 Jan 23 16:23 /dev/null\n', stderr=b'')
Popen Constructor
Execute a child program in a new process.
example of passing some arguments to an external program as a sequence:
Popen(["/usr/bin/git", "commit", "-m", "Fixes a bug."])
example to break a shell command into a sequence of arguments. shlex.split() can illustrate how to determine the correct tokenization for args:
>>> import shlex, subprocess
>>> command_line = input()
/bin/vikings -input eggs.txt -output "spam spam.txt" -cmd "echo '$MONEY'"
>>> args = shlex.split(command_line)
>>> print(args)
['/bin/vikings', '-input', 'eggs.txt', '-output', 'spam spam.txt', '-cmd', "echo '$MONEY'"]
>>> p = subprocess.Popen(args) # Success!
SVG drawings
python svgwrite
A Python library to create SVG drawings.
Python modules for Inkscape extensions
svgwrite
A Python library to create SVG drawings.
a simple example:
import svgwrite
dwg = svgwrite.Drawing('test.svg', profile='tiny')
dwg.add(dwg.line((0, 0), (10, 0), stroke=svgwrite.rgb(10, 10, 16, '%')))
dwg.add(dwg.text('Test', insert=(0, 0.2), fill='red'))
dwg.save()
As the name svgwrite implies, svgwrite creates new SVG drawings, it does not read existing drawings and also does not import existing drawings, but you can always include other SVG drawings by the <image> entity.
Installation
with pip:
pip install svgwrite
or from source:
python setup.py install
Documentation
http://readthedocs.org/docs/svgwrite/
svgwrite can be found on GitHub.com at:
http://github.com/mozman/svgwrite.git
Inkscape extensions by non developersInkscape extensions
To access the browser's local storage when scraping a page, we need to invoke both a browser instance and leverage a JavaScript interpreter to read the local storage.
For my money, Selenium is the best solution.
A possible replacement for Selenium is PhantomJS, running a headless browser.
JaveScript to iterate over localStorage browser object
for (var i = 0; i < localStorage.length; i++){
key=localStorage.key(i);
console.log(key+': '+localStorage.getItem(key));
}
Advanced script
As mentioned here a HTML5 featured browser should also implement Array.prototype.map.
So script would be:
Array.apply(0, new Array(localStorage.length)).map(function (o, i)
{ return localStorage.key(i)+':'+localStorage.getItem(localStorage.key(i)); }
)
Python with Selenium script for setting up and scraping local storage
from selenium import webdriver
driver = webdriver.Firefox()
url='http://www.w3schools.com/'
driver.get(url)
scriptArray="""localStorage.setItem("key1", 'new item');
localStorage.setItem("key2", 'second item');
return Array.apply(0, new Array(localStorage.length)).map(function (o, i) { return localStorage.getItem(localStorage.key(i)); }
)"""
result = driver.execute_script(scriptArray)
print(result)
Python bindings alternative to Python+Selenium
Some might argue Selenium is inefficient for only local storage extracting.
If you think Selenium is too bulky, you might want to try a Python binding with a development framework for desktop, ex.
PyQt.
Something I might touch on in a later post.
Running Python in the web browser has been getting a lot of attention lately.
Shaun Taylor-Morgan knows what he's talking about here – he works for Anvil, a full-featured application platform for writing full-stack web apps with nothing but Python.
So I invited him to give us an overview and comparison of the open-source solutions for running Python code in your web browser.
In the past, if you wanted to build a web UI, your only choice was JavaScript.
That's no longer true.
There are quite a few ways to run Python in your web browser.
This is a survey of what's available.
I'm looking at six systems that all take a different approach to the problem.
Here's a diagram that sums up their differences.
The x-axis answers the question: when does Python get compiled? At one extreme, you run a command-line script to compile Python yourself.
At the other extreme, the compilation gets done in the user's browser as they write Python code.
The y-axis answers the question: what does Python get compiled to? Three systems make a direct conversion between the Python you write and some equivalent JavaScript.
The other three actually run a live Python interpreter in your browser, each in a slightly different way.
1. TRANSCRYPT
Transcrypt gives you a command-line tool you can run to compile a Python script into a JavaScript file.
You interact with the page structure (the DOM) using a toolbox of specialized Python objects and functions.
For example, if you import document, you can find any object on the page by using document like a dictionary.
To get the element whose ID is name-box, you would use document["name-box"].
Any readers familiar with JQuery will be feeling very at home.
Here's a basic example.
I wrote a Hello, World page with just an input box and a button:
<input id="name-box" placeholder="Enter your name">
<button id="greet-button">Say Hello</button>
To make it do something, I wrote some Python.
When you click the button, an event handler fires that displays an alert with a greeting:
def greet():
alert("Hello " + document.getElementById("name-box").value + "!")
document.getElementById("greet-button").addEventListener('click', greet)
I wrote this in a file called hello.py and compiled it using transcrypt hello.py.
The compiler spat out a JavaScript version of my file, called hello.js.
Transcrypt makes the conversion to JavaScript at the earliest possible time – before the browser is even running.
Next we'll look at Brython, which makes the conversion on page load.
2. BRYTHON
Brython lets you write Python in script tags in exactly the same way you write JavaScript.
Just as with Transcrypt, it has a document object for interacting with the DOM.
The same widget I wrote above can be written in a script tag like this:
<script type="text/python">
from browser import document, alert
def greet(event):
alert("Hello " + document["name-box"].value + "!")
document["greet-button"].bind("click", greet)
</script>
Pretty cool, huh? A script tag whose type is text/python!
There's a good explanation of how it works on the Brython GitHub page.
In short, you run a function when your page loads:
<body onload="brython()">
that transpiles anything it finds in a Python script tag:
<script type="text/python"></script>
which results in some machine-generated JavaScript that it runs using JS's eval() function.
3. SKULPT
Skulpt sits at the far end of our diagram – it compiles Python to JavaScript at runtime.
This means the Python doesn't have to be written until after the page has loaded.
The Skulpt website has a Python REPL that runs in your browser.
It's not making requests back to a Python interpreter on a server somewhere, it's actually running on your machine.
Skulpt does not have a built-in way to interact with the DOM.
This can be an advantage, because you can build your own DOM manipulation system depending on what you're trying to achieve.
More on this later.
Skulpt was originally created to produce educational tools that need a live Python session on a web page (example: Trinket.io).
While Transcrypt and Brython are designed as direct replacements for JavaScript, Skulpt is more suited to building Python programming environments on the web (such as the full-stack app platform, Anvil).
We've reached the end of the x-axis in our diagram.
Next we head in the vertical direction: our final three technologies don't compile Python to JavaScript, they actually implement a Python runtime in the web browser.
4. PYPY.JS
PyPy.js is a JavaScript implementation of a Python interpreter.
The developers took a C-to-JavaScript compiler called emscripten and ran it on the source code of PyPy.
The result is PyPy, but running in your browser.
Advantages: It's a very faithful implementation of Python, and code gets executed quickly.
Disadvantages: A web page that embeds PyPy.js contains an entire Python interpreter, so it's pretty big as web pages go (think megabytes).
You import the interpreter using <script> tags, and you get an object called pypyjs in the global JS scope.
There are three main functions for interacting with the interpreter.
To execute some Python, run pypyjs.exec(<python code>).
To pass values between JavaScript and Python, use pypyjs.set(variable, value) and pypyjs.get(variable).
Here's a script that uses PyPy.js to calculate the first ten square numbers:
<script type="text/javascript">
pypyjs.exec(
// Run some Python
'y = [x**2.
for x in range(10)]'
).then(function() {
// Transfer the value of y from Python to JavaScript
pypyjs.get('y')
}).then(function(result) {
// Display an alert box with the value of y in it
alert(result)
});
</script>
PyPy.js has a few features that make it feel like a native Python environment – there's even an in-memory filesystem so you can read and write files.
There's also a document object that gives you access to the DOM from Python.
The project has a great readme if you're interested in learning more.
5. BATAVIA
Batavia is a bit like PyPy.js, but it runs bytecode rather than Python.
Here's a Hello, World script written in Batavia:
<script id="batavia-helloworld" type="application/python-bytecode">
7gwNCkIUE1cWAAAA4wAAAAAAAAAAAAAAAAIAAABAAAAAcw4AAABlAABkAACDAQABZAEAUykCegtI
ZWxsbyBXb3JsZE4pAdoFcHJpbnSpAHICAAAAcgIAAAD6PC92YXIvZm9sZGVycy85cC9uenY0MGxf
OTc0ZGRocDFoZnJjY2JwdzgwMDAwZ24vVC90bXB4amMzZXJyddoIPG1vZHVsZT4BAAAAcwAAAAA=
</script>
Bytecode is the ‘assembly language' of the Python virtual machine – if you've ever looked at the .pyc files Python generates, that's what they contain (Yasoob dug into some bytecode in a recent post on this blog).
This example doesn't look like assembly language because it's base64-encoded.
Batavia is potentially faster than PyPy.js, since it doesn't have to compile your Python to bytecode.
It also makes the download smaller – around 400kB.
The disadvantage is that your code needs to be written and compiled in a native (non-browser) environment, as was the case with Transcrypt.
Again, Batavia lets you manipulate the DOM using a Python module it provides (in this case it's called dom).
The Batavia project is quite promising because it fills an otherwise unfilled niche – ahead-of-time compiled Python in the browser that runs in a full Python VM.
Unfortunately, the GitHub repo's commit rate seems to have slowed in the past year or so.
If you're interested in helping out, here's their developer guide.
6. PYODIDE
Mozilla's Pyodide was announced in April 2019.
It solves a difficult problem: interactive data visualisation in Python, in the browser.
Python has become a favourite language for data science thanks to libraries such as NumPy, SciPy, Matplotlib and Pandas.
We already have Jupyter Notebooks, which are a great way to present a data pipeline online, but they must be hosted on a server somewhere.
If you can put the data processing on the user's machine, they avoid the round-trip to your server so real-time visualisation is more powerful.
And you can scale to so many more users if their own machines are providing the compute.
It's easier said than done.
Fortunately, the Mozilla team came across a version of the reference Python implementation (CPython) that was compiled into WebAssembly.
WebAssembly is a low-level compliment to JavaScript that performs closer to native speeds, which opens the browser up for performance-critical applications like this.
Mozilla took charge of the WebAssembly CPython project and recompiled NumPy, SciPy, Matplotlib and Pandas into WebAssembly too.
The result is a lot like Jupyter Notebooks in the browser – here's an introductory notebook.
It's an even bigger download than PyPy.js (that example is around 50MB), but as Mozilla point out, a good browser will cache that for you.
And for a data processing notebook, waiting a few seconds for the page to load is not a problem.
You can write HTML, MarkDown and JavaScript in Pyodide Notebooks too.
And yes, there's a document object to access the DOM.
It's a really promising project!
MAKING A CHOICE
I've given you six different ways to write Python in the browser, and you might be able to find more.
Which one to choose? This summary table may help you decide.
There's a more general point here too: the fact that there is a choice.
As a web developer, it often feels like you have to write JavaScript, you have to build an HTTP API, you have to write SQL and HTML and CSS.
The six systems we've looked at make JavaScript seem more like a language that gets compiled to, and you choose what to compile to it (And WebAssembly is actually designed to be used this way).
Why not treat the whole web stack this way? The future of web development is to move beyond the technologies that we've always ‘had' to use.
The future is to build abstractions on top of those technologies, to reduce the unnecessary complexity and optimise developer efficiency.
That's why Python itself is so popular – it's a language that puts developer efficiency first.
ONE UNIFIED SYSTEM
There should be one way to represent data, from the database all the way to the UI.
Since we're Pythonistas, we'd like everything to be a Python object, not an SQL SELECT statement followed by a Python object followed by JSON followed by a JavaScript object followed by a DOM element.
That's what Anvil does – it's a full-stack Python environment that abstracts away the complexity of the web. Here's a 7-minute video that covers how it works.
Remember I said that it can be an advantage that Skulpt doesn't have a built-in way to interact with the DOM? This is why.
If you want to go beyond ‘Python in the browser' and build a fully-integrated Python environment, your abstraction of the User Interface needs to fit in with your overall abstraction of the web system.
So Python in the browser is just the start of something bigger.
I like to live dangerously, so I'm going to make a prediction.
In 5 years' time, more than 50% of web apps will be built with tools that sit one abstraction level higher than JavaScript frameworks such as React and Angular.
It has already happened for static sites: most people who want a static site will use WordPress or Wix rather than firing up a text editor and writing HTML.
As systems mature, they become unified and the amount of incidental complexity gradually minimises.
Brython tutorial
This tutorial explains how to develop an application that runs in the browser using the Python programming language.
We will take the example of writing a calculator.
You will need a text editor, and of course a browser with an Internet access.
The contents of this tutorial assumes that you have at least a basic knowledge of HTML (general page structure, most usual tags), of stylesheets (CSS) and of the Python language.
In the text editor, create an HTML page with the following content:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<script type="text/javascript"
src="https://cdn.jsdelivr.net/npm/brython@3.8.9/brython.min.js">
</script>
</head>
<body onload="brython()">
<script type="text/python">
from browser import document
document <= "Hello !"
</script>
</body>
</html>
In an empty directory, save this page as index.html.
To read it in the browser, you have two options:
use the File/Open menu: it is the most simple solution.
It brings some limitations for an advanced use, but it works perfectly for this tutorial
launch a web server : for instance, if the Python interpreter available from python.org is available on your machine, run python -m http.server in the file directory, then enter localhost:8000/index.html in the browser address bar
When you open the page, you should see the message "Hello !" printed on the browser window.
Page structure
Let's take a look at the page contents.
In the <head> zone we load the script brython.js : it is the Brython engine, the program that will find and execute the Python scripts included in the page.
In this example we get it from a CDN, so that there is nothing to install on the PC.
Note the version number (brython@3.8.9) : it can be updated for each new Brython version.
The <body> tag has an attribute onload="brython()".
It means that when the page has finished loading, the browser has to call the function brython(), which is defined in the Brython engine loaded in the page.
The function searches all the <script>tags that have the attribute type="text/python" and executes them.
Our index.html page embeds this script:
from browser import document
document <= "Hello !"
This is a standard Python program, starting by the import of a module, browser (in this case, a module shipped with the Brython engine brython.js).
The module has an attribute document which references the content displayed in the browser window.
To add a text to the document - concretely, to display a text in the browser - the syntax used by Brython is
document <= "Hello !"
You can think of the <= sign as a left arrow : the document "receives" a new element, here the string "Hello !".
You will see later that it is always possible to use the standardized DOM syntax to interact with the page, by Brython provides a few shortcuts to make the code less verbose.
Text formatting with HTML tags
HTML tags allow text formatting, for instance to write it in bold letters (<B> tag), in italic (<I>), etc.
With Brython, these tags are available as functions defined in module html of the browser package.
Here is how to use it:
from browser import document, html
document <= html.B("Hello !")
Tags can be nested:
document <= html.B(html.I("Hello !"))
Tags can also be added to each other, as well as strings:
document <= html.B("Hello, ") + "world !"
The first argument of a tag function can be a string, a number, another tag.
It can also be a Python "iterable" (list, comprehension, generator): in this case, all the elements produced in the iteration are added to the tag:
document <= html.UL(html.LI(i) for i in range(5))
Tag attributes are passed as keyword arguments to the function:
html.A("Brython", href="http://brython.info")
Drawing the calculator
We can draw our calculator as an HTML table.
The first line is made of the result zone, followed by a reset button.
The next 3 lines are the calculator touches, digits and operations.
from browser import document, html
calc = html.TABLE()
calc <= html.TR(html.TH(html.DIV("0", id="result"), colspan=3) +
html.TH("C", id="clear"))
lines = ["789/",
"456*",
"123-",
"0.=+"]
calc <= (html.TR(html.TD(x) for x in line) for line in lines)
document <= calc
Note the use of Python generators to reduce the program size, while keeping it readable.
Let's add style to the <TD> tags in a stylesheet so that the calculator looks better:
<style>
*{
font-family: sans-serif;
font-weight: normal;
font-size: 1.1em;
}
td{
background-color: #ccc;
padding: 10px 30px 10px 30px;
border-radius: 0.2em;
text-align: center;
cursor: default;
}
#result{
border-color: #000;
border-width: 1px;
border-style: solid;
padding: 10px 30px 10px 30px;
text-align: right;
}
</style>
Event handling
The next step is to trigger an action when the user presses the calculator touches:
for digits and operations : print(the digit or operation in the result )zone
for the = sign : execute the operation and print(the result, or an error )message if the input is invalid
for the C letter : reset the result zone
To handle the elements printed in the page, the program need first to get a reference to them.
The buttons have been created as <TD> tags; to get a reference to all these tags, the syntax is
document.select("td")
The result of select() is always a list of elements.
The events that can occur on the elements of a page have a normalized name: when the user clicks on a button, the event called "click" is triggered.
In the program, this event will provoque the execution of a function.
The association betweeen element, event and function is defined by the syntax
element.bind("click", action)
For the calculator, we can associate the same function to the "click" event on all buttons by:
for button in document.select("td"):
button.bind("click", action)
To be compliant to Python syntax, the function action() must have been defined somewhere before in the program.
Such "callback" functions take a single parameter, an object that represents the event.
Complete program
Here is the code that manages a minimal version of the calculator.
The most important part is in the function action(event).
from browser import document, html
# Construction de la calculatrice
calc = html.TABLE()
calc <= html.TR(html.TH(html.DIV("0", id="result"), colspan=3) +
html.TD("C"))
lines = ["789/", "456*", "123-", "0.=+"]
calc <= (html.TR(html.TD(x) for x in line) for line in lines)
document <= calc
result = document["result"] # direct acces to an element by its id
def action(event):
"""Handles the "click" event on a button of the calculator."""
# The element the user clicked on is the attribute "target" of the
# event object
element = event.target
# The text printed on the button is the element's "text" attribute
value = element.text
if value not in "=C":
# update the result zone
if result.text in ["0", "error"]:
result.text = value
else:
result.text = result.text + value
elif value == "C":
# reset
result.text = "0"
elif value == "=":
# execute the formula in result zone
try:
result.text = eval(result.text)
except:
result.text = "error"
# Associate function action() to the event "click" on all buttons
for button in document.select("td"):
button.bind("click", action)
selenium with TimPython Selenium Tutorial #1 - Web Scraping, Bots & TestingLocating Elements From HTMLSelenium with Python
from selenium import webdriver # note! this file name cannot be selenium.py because this is not the library
PATH = "D:\Python36-32\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://williamkpchan.github.io/LibDocs/python%20notes.html")
#driver.close() # this close the tab only if more than on tab on browser
print(driver.title)
driver.quit()
Tech with Tim sample:
From selenium import webdriver
from selenium.webdriver.common.Keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected conditions as EC
import time
PATH = "Program Files\Chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://techwithtim.net")
print(driver.title)
search = driver.find_element_by_name("s") search.send_keys("test")
search.send_keys(Keys.RETURN)
try:
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "main"))
)
articles = mal.n.find_e/ements by tag name("article")
for article in articles:
header = article.find_element_by_class_name("entry-summary")
print(header.text)
finally:
driver.quit()
Python call an external command
import subprocess
subprocess.run(["ls", "-l"])
import os
os.system("your command")
stream = os.popen("some_command with args")
subprocess.call(['ping', 'localhost'])
print subprocess.Popen("echo Hello World", shell=True, stdout=subprocess.PIPE).stdout.read()
print os.popen("echo Hello World").read()
return_code = subprocess.call("echo Hello World", shell=True)
print subprocess.Popen("echo %s " % user_input, stdout=PIPE).stdout.read()
import subprocess
p = subprocess.Popen('ls', shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in p.stdout.readlines():
print line,
retval = p.wait()
常用 Matplotlib 图的 Python 代码
# !pip install brewer2mpl
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.filterwarnings(action='once')
large = 22; med = 16; small = 12
params = {'axes.titlesize': large,
'legend.fontsize': med,
'figure.figsize': (16, 10),
'axes.labelsize': med,
'axes.titlesize': med,
'xtick.labelsize': med,
'ytick.labelsize': med,
'figure.titlesize': large}
plt.rcParams.update(params)
plt.style.use('seaborn-whitegrid')
sns.set_style("white")
%matplotlib inline
# Version
print(mpl.__version__) #> 3.0.0
print(sns.__version__) #> 0.9.0
1. 散点图
Scatteplot是用于研究两个变量之间关系的经典和基本图。
如果数据中有多个组,则可能需要以不同颜色可视化每个组。
在Matplotlib,你可以方便地使用。
# Import dataset
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")
# Prepare Data
# Create as many colors as there are unique midwest['category']
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]
# Draw Plot for Each Category
plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
for i, category in enumerate(categories):
plt.scatter('area', 'poptotal',
data=midwest.loc[midwest.category==category, :],
s=20, c=colors[i], label=str(category))
# Decorations
plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),
xlabel='Area', ylabel='Population')
plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Scatterplot of Midwest Area vs Population", fontsize=22)
plt.legend(fontsize=12)
plt.show()
2. 带边界的气泡图
有时,您希望在边界内显示一组点以强调其重要性。
在此示例中,您将从应该被环绕的数据帧中获取记录,并将其传递给下面的代码中描述的记录。
encircle()
from matplotlib import patches
from scipy.spatial import ConvexHull
import warnings; warnings.simplefilter('ignore')
sns.set_style("white")
# Step 1: Prepare Data
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")
# As many colors as there are unique midwest['category']
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]
# Step 2: Draw Scatterplot with unique color for each category
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
for i, category in enumerate(categories):
plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :], s='dot_size', c=colors[i], label=str(category), edgecolors='black', linewidths=.5)
# Step 3: Encircling
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot
def encircle(x,y, ax=None, **kw):
if not ax: ax=plt.gca()
p = np.c_[x,y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:], **kw)
ax.add_patch(poly)
# Select data to be encircled
midwest_encircle_data = midwest.loc[midwest.state=='IN', :]
# Draw polygon surrounding vertices
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)
# Step 4: Decorations
plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),
xlabel='Area', ylabel='Population')
plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Bubble Plot with Encircling", fontsize=22)
plt.legend(fontsize=12)
plt.show()
3. 带线性回归最佳拟合线的散点图
如果你想了解两个变量如何相互改变,那么最合适的线就是要走的路。
下图显示了数据中各组之间最佳拟合线的差异。
要禁用分组并仅为整个数据集绘制一条最佳拟合线,请从下面的调用中删除该参数。
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]), :]
# Plot
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,
height=7, aspect=1.6, robust=True, palette='tab10',
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
# Decorations
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.title("Scatterplot with line of best fit grouped by number of cylinders", fontsize=20)
每个回归线都在自己的列中
或者,您可以在其自己的列中显示每个组的最佳拟合线。
你可以通过在里面设置参数来实现这一点。
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]), :]
# Each line in its own column
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy",
data=df_select,
height=7,
robust=True,
palette='Set1',
col="cyl",
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
# Decorations
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.show()
4. 抖动图
通常,多个数据点具有完全相同的X和Y值。
结果,多个点相互绘制并隐藏。
为避免这种情况,请稍微抖动点,以便您可以直观地看到它们。
这很方便使用
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)
# Decorations
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)
plt.show()
5. 计数图
避免点重叠问题的另一个选择是增加点的大小,这取决于该点中有多少点。
因此,点的大小越大,周围的点的集中度就越大。
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_counts = df.groupby(['hwy', 'cty']).size().reset_index(name='counts')
# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax)
# Decorations
plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)
plt.show()
6. 边缘直方图
边缘直方图具有沿X和Y轴变量的直方图。
这用于可视化X和Y之间的关系以及单独的X和Y的单变量分布。
该图如果经常用于探索性数据分析(EDA)。
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
# Create Fig and gridspec
fig = plt.figure(figsize=(16, 10), dpi= 80)
grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)
# Define the axes
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])
# Scatterplot on main ax
ax_main.scatter('displ', 'hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="tab10", edgecolors='gray', linewidths=.5)
# histogram on the right
ax_bottom.hist(df.displ, 40, histtype='stepfilled', orientation='vertical', color='deeppink')
ax_bottom.invert_yaxis()
# histogram in the bottom
ax_right.hist(df.hwy, 40, histtype='stepfilled', orientation='horizontal', color='deeppink')
# Decorations
ax_main.set(title='Scatterplot with Histograms
displ vs hwy', xlabel='displ', ylabel='hwy')
ax_main.title.set_fontsize(20)
for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
item.set_fontsize(14)
xlabels = ax_main.get_xticks().tolist()
ax_main.set_xticklabels(xlabels)
plt.show()
7.边缘箱形图
边缘箱图与边缘直方图具有相似的用途。
然而,箱线图有助于精确定位X和Y的中位数,第25和第75百分位数。
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
# Create Fig and gridspec
fig = plt.figure(figsize=(16, 10), dpi= 80)
grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)
# Define the axes
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])
# Scatterplot on main ax
ax_main.scatter('displ', 'hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5)
# Add a graph in each part
sns.boxplot(df.hwy, ax=ax_right, orient="v")
sns.boxplot(df.displ, ax=ax_bottom, orient="h")
# Decorations ------------------
# Remove x axis name for the boxplot
ax_bottom.set(xlabel='')
ax_right.set(ylabel='')
# Main Title, Xlabel and YLabel
ax_main.set(title='Scatterplot with Histograms
displ vs hwy', xlabel='displ', ylabel='hwy')
# Set font size of different components
ax_main.title.set_fontsize(20)
for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
item.set_fontsize(14)
plt.show()
8. 相关图
Correlogram用于直观地查看给定数据帧(或2D数组)中所有可能的数值变量对之间的相关度量。
# Import Dataset
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
# Plot
plt.figure(figsize=(12,10), dpi= 80)
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)
# Decorations
plt.title('Correlogram of mtcars', fontsize=22)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
9. 矩阵图
成对图是探索性分析中的最爱,以理解所有可能的数字变量对之间的关系。
它是双变量分析的必备工具。
# Load Dataset
df = sns.load_dataset('iris')
# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
plt.show()
# Load Dataset
df = sns.load_dataset('iris')
# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="reg", hue="species")
plt.show()
偏差
10. 发散型条形图
如果您想根据单个指标查看项目的变化情况,并可视化此差异的顺序和数量,那么发散条是一个很好的工具。
它有助于快速区分数据中组的性能,并且非常直观,并且可以立即传达这一点。
# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
# Draw plot
plt.figure(figsize=(14,10), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)
# Decorations
plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.show()
11. 发散型文本
分散的文本类似于发散条,如果你想以一种漂亮和可呈现的方式显示图表中每个项目的价值,它更喜欢。
# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red'
if x < 0
else 'green'
for x
in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
# Draw plot
plt.figure(figsize=(14,14), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z)
for x, y, tex
in zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex, 2), horizontalalignment='right'
if x < 0
else 'left',
verticalalignment='center', fontdict={'color':'red'
if x < 0
else 'green', 'size':14})
# Decorations
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-2.5, 2.5)
plt.show()
12. 发散型包点图
发散点图也类似于发散条。
然而,与发散条相比,条的不存在减少了组之间的对比度和差异。
# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red' if x < 0 else 'darkgreen' for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
# Draw plot
plt.figure(figsize=(14,16), dpi= 80)
plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors)
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex, 1), horizontalalignment='center',
verticalalignment='center', fontdict={'color':'white'})
# Decorations
# Lighten borders
plt.gca().spines["top"].set_alpha(.3)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(.3)
plt.gca().spines["left"].set_alpha(.3)
plt.yticks(df.index, df.cars)
plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20})
plt.xlabel('$Mileage$')
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-2.5, 2.5)
plt.show()
13. 带标记的发散型棒棒糖图
带标记的棒棒糖通过强调您想要引起注意的任何重要数据点并在图表中适当地给出推理,提供了一种可视化分歧的灵活方式。
# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = 'black'
# color fiat differently
df.loc[df.cars == 'Fiat X1-9', 'colors'] = 'darkorange'
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
# Draw plot
import matplotlib.patches as patches
plt.figure(figsize=(14,16), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=1)
plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600 if x == 'Fiat X1-9' else 300 for x in df.cars], alpha=0.6)
plt.yticks(df.index, df.cars)
plt.xticks(fontsize=12)
# Annotate
plt.annotate('Mercedes Models', xy=(0.0, 11.0), xytext=(1.0, 11), xycoords='data',
fontsize=15, ha='center', va='center',
bbox=dict(boxstyle='square', fc='firebrick'),
arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5', lw=2.0, color='steelblue'), color='white')
# Add Patches
p1 = patches.Rectangle((-2.0, -1), width=.3, height=3, alpha=.2, facecolor='red')
p2 = patches.Rectangle((1.5, 27), width=.8, height=5, alpha=.2, facecolor='green')
plt.gca().add_patch(p1)
plt.gca().add_patch(p2)
# Decorate
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.show()
14.面积图
通过对轴和线之间的区域进行着色,区域图不仅强调峰值和低谷,而且还强调高点和低点的持续时间。
高点持续时间越长,线下面积越大。
import numpy as np
import pandas as pd
# Prepare Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv", parse_dates=['date']).head(100)
x = np.arange(df.shape[0])
y_returns = (df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0) * 100
# Plot
plt.figure(figsize=(16,10), dpi= 80)
plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] >= 0, facecolor='green', interpolate=True, alpha=0.7)
plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] <= 0, facecolor='red', interpolate=True, alpha=0.7)
# Annotate
plt.annotate('Peak
1975', xy=(94.0, 21.0), xytext=(88.0, 28),
bbox=dict(boxstyle='square', fc='firebrick'),
arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white')
# Decorations
xtickvals = [str(m)[:3].upper()+"-"+str(y) for y,m in zip(df.date.dt.year, df.date.dt.month_name())]
plt.gca().set_xticks(x[::6])
plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment': 'center', 'verticalalignment': 'center_baseline'})
plt.ylim(-35,35)
plt.xlim(1,100)
plt.title("Month Economics Return %", fontsize=22)
plt.ylabel('Monthly returns %')
plt.grid(alpha=0.5)
plt.show()
15. 有序条形图
有序条形图有效地传达了项目的排名顺序。
但是,在图表上方添加度量标准的值,用户可以从图表本身获取精确信息。
# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
# Draw plot
import matplotlib.patches as patches
fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi= 80)
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=20)
# Annotate Text
for i, cty in enumerate(df.cty):
ax.text(i, cty+0.5, round(cty, 1), horizontalalignment='center')
# Title, Label, Ticks and Ylim
ax.set_title('Bar Chart for Highway Mileage', fontdict={'size':22})
ax.set(ylabel='Miles Per Gallon', ylim=(0, 30))
plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60, horizontalalignment='right', fontsize=12)
# Add patches to color the X axis labels
p1 = patches.Rectangle((.57, -0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure)
p2 = patches.Rectangle((.124, -0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure)
fig.add_artist(p1)
fig.add_artist(p2)
plt.show()
16. 棒棒糖图
棒棒糖图表以一种视觉上令人愉悦的方式提供与有序条形图类似的目的。
# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
# Draw plot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2)
ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7)
# Title, Label, Ticks and Ylim
ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22})
ax.set_ylabel('Miles Per Gallon')
ax.set_xticks(df.index)
ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60, fontdict={'horizontalalignment': 'right', 'size':12})
ax.set_ylim(0, 30)
# Annotate
for row in df.itertuples():
ax.text(row.Index, row.cty+.5, s=round(row.cty, 2), horizontalalignment= 'center', verticalalignment='bottom', fontsize=14)
plt.show()
17. 包点图
点图表传达了项目的排名顺序。
由于它沿水平轴对齐,因此您可以更容易地看到点彼此之间的距离。
# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
# Draw plot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')
ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)
# Title, Label, Ticks and Ylim
ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})
ax.set_xlabel('Miles Per Gallon')
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'})
ax.set_xlim(10, 27)
plt.show()
18. 坡度图
斜率图最适合比较给定人/项目的“之前”和“之后”位置。
import matplotlib.lines as mlines
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv")
left_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1952'])]
right_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1957'])]
klass = ['red' if (y1-y2) < 0 else 'green' for y1, y2 in zip(df['1952'], df['1957'])]
# draw line
# https://stackoverflow.com/questions/36470343/how-to-draw-a-line-with-matplotlib/36479941
def newline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='red' if p1[1]-p2[1] > 0 else 'green', marker='o', markersize=6)
ax.add_line(l)
return l
fig, ax = plt.subplots(1,1,figsize=(14,14), dpi= 80)
# Vertical Lines
ax.vlines(x=1, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')
ax.vlines(x=3, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')
# Points
ax.scatter(y=df['1952'], x=np.repeat(1, df.shape[0]), s=10, color='black', alpha=0.7)
ax.scatter(y=df['1957'], x=np.repeat(3, df.shape[0]), s=10, color='black', alpha=0.7)
# Line Segmentsand Annotation
for p1, p2, c in zip(df['1952'], df['1957'], df['continent']):
newline([1,p1], [3,p2])
ax.text(1-0.05, p1, c + ', ' + str(round(p1)), horizontalalignment='right', verticalalignment='center', fontdict={'size':14})
ax.text(3+0.05, p2, c + ', ' + str(round(p2)), horizontalalignment='left', verticalalignment='center', fontdict={'size':14})
# 'Before' and 'After' Annotations
ax.text(1-0.05, 13000, 'BEFORE', horizontalalignment='right', verticalalignment='center', fontdict={'size':18, 'weight':700})
ax.text(3+0.05, 13000, 'AFTER', horizontalalignment='left', verticalalignment='center', fontdict={'size':18, 'weight':700})
# Decoration
ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict={'size':22})
ax.set(xlim=(0,4), ylim=(0,14000), ylabel='Mean GDP Per Capita')
ax.set_xticks([1,3])
ax.set_xticklabels(["1952", "1957"])
plt.yticks(np.arange(500, 13000, 2000), fontsize=12)
# Lighten borders
plt.gca().spines["top"].set_alpha(.0)
plt.gca().spines["bottom"].set_alpha(.0)
plt.gca().spines["right"].set_alpha(.0)
plt.gca().spines["left"].set_alpha(.0)
plt.show()
19. 哑铃图
哑铃图传达各种项目的“前”和“后”位置以及项目的排序。
如果您想要将特定项目/计划对不同对象的影响可视化,那么它非常有用。
import matplotlib.lines as mlines
# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/health.csv")
df.sort_values('pct_2014', inplace=True)
df.reset_index(inplace=True)
# Func to draw line segment
def newline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='skyblue')
ax.add_line(l)
return l
# Figure and Axes
fig, ax = plt.subplots(1,1,figsize=(14,14), facecolor='#f7f7f7', dpi= 80)
# Vertical Lines
ax.vlines(x=.05, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.10, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.15, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.20, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
# Points
ax.scatter(y=df['index'], x=df['pct_2013'], s=50, color='#0e668b', alpha=0.7)
ax.scatter(y=df['index'], x=df['pct_2014'], s=50, color='#a3c4dc', alpha=0.7)
# Line Segments
for i, p1, p2 in zip(df['index'], df['pct_2013'], df['pct_2014']):
newline([p1, i], [p2, i])
# Decoration
ax.set_facecolor('#f7f7f7')
ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014", fontdict={'size':22})
ax.set(xlim=(0,.25), ylim=(-1, 27), ylabel='Mean GDP Per Capita')
ax.set_xticks([.05, .1, .15, .20])
ax.set_xticklabels(['5%', '15%', '20%', '25%'])
ax.set_xticklabels(['5%', '15%', '20%', '25%'])
plt.show()
20. 连续变量的直方图
直方图显示给定变量的频率分布。
下面的表示基于分类变量对频率条进行分组,从而更好地了解连续变量和串联变量。
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Prepare data
x_var = 'displ'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
# Draw
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, 30, stacked=True, density=False, color=colors[:len(vals)])
# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 25)
plt.xticks(ticks=bins[::3], labels=[round(b,1) for b in bins[::3]])
plt.show()
21. 类型变量的直方图
分类变量的直方图显示该变量的频率分布。
通过对条形图进行着色,您可以将分布与表示颜色的另一个分类变量相关联。
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Prepare data
x_var = 'manufacturer'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
# Draw
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])
# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 40)
plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left')
plt.show()
22. 密度图
密度图是一种常用工具,可视化连续变量的分布。
通过“响应”变量对它们进行分组,您可以检查X和Y之间的关系。
以下情况,如果出于代表性目的来描述城市里程的分布如何随着汽缸数的变化而变化。
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Draw Plot
plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df.loc[df['cyl'] == 4, "cty"], shade=True, color="g", label="Cyl=4", alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 5, "cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 6, "cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 8, "cty"], shade=True, color="orange", label="Cyl=8", alpha=.7)
# Decoration
plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22)
plt.legend()
23. 直方密度线图
带有直方图的密度曲线将两个图表传达的集体信息汇集在一起,这样您就可以将它们放在一个图形而不是两个图形中。
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Draw Plot
plt.figure(figsize=(13,10), dpi= 80)
sns.distplot(df.loc[df['class'] == 'compact', "cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class'] == 'suv', "cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class'] == 'minivan', "cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
plt.ylim(0, 0.35)
# Decoration
plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)
plt.legend()
plt.show()
24. Joy Plot
Joy Plot允许不同组的密度曲线重叠,这是一种可视化相对于彼此的大量组的分布的好方法。
它看起来很悦目,并清楚地传达了正确的信息。
它可以使用joypy基于的包来轻松构建matplotlib。
# !pip install joypy
# Import Data
mpg = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Draw Plot
plt.figure(figsize=(16,10), dpi= 80)
fig, axes = joypy.joyplot(mpg, column=['hwy', 'cty'], by="class", ylim='own', figsize=(14,10))
# Decoration
plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22)
plt.show()
25. 分布式点图
分布点图显示按组分割的点的单变量分布。
点数越暗,该区域的数据点集中度越高。
通过对中位数进行不同着色,组的真实定位立即变得明显。
import matplotlib.patches as mpatches
# Prepare Data
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
cyl_colors = {4:'tab:red', 5:'tab:green', 6:'tab:blue', 8:'tab:orange'}
df_raw['cyl_color'] = df_raw.cyl.map(cyl_colors)
# Mean and Median city mileage by make
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', ascending=False, inplace=True)
df.reset_index(inplace=True)
df_median = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.median())
# Draw horizontal lines
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.hlines(y=df.index, xmin=0, xmax=40, color='gray', alpha=0.5, linewidth=.5, linestyles='dashdot')
# Draw the Dots
for i, make in enumerate(df.manufacturer):
df_make = df_raw.loc[df_raw.manufacturer==make, :]
ax.scatter(y=np.repeat(i, df_make.shape[0]), x='cty', data=df_make, s=75, edgecolors='gray', c='w', alpha=0.5)
ax.scatter(y=i, x='cty', data=df_median.loc[df_median.index==make, :], s=75, c='firebrick')
# Annotate
ax.text(33, 13, "$red ; dots ; are ; the : median$", fontdict={'size':12}, color='firebrick')
# Decorations
red_patch = plt.plot([],[], marker="o", ms=10, ls="", mec=None, color='firebrick', label="Median")
plt.legend(handles=red_patch)
ax.set_title('Distribution of City Mileage by Make', fontdict={'size':22})
ax.set_xlabel('Miles Per Gallon (City)', alpha=0.7)
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'}, alpha=0.7)
ax.set_xlim(1, 40)
plt.xticks(alpha=0.7)
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["bottom"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.gca().spines["left"].set_visible(False)
plt.grid(axis='both', alpha=.4, linewidth=.1)
plt.show()
Machine Learning Project Walk-Through in Python: Part One
Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together.
Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem.
This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.
We’ll follow the general machine learning workflow step-by-step:
Data cleaning and formatting
Exploratory data analysis
Feature engineering and selection
Compare several machine learning models on a performance metric
Perform hyperparameter tuning on the best model
Evaluate the best model on the testing set
Interpret the model results
Draw conclusions and document work
Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python.
The complete project is available on GitHub, with the first notebook here. This first article will cover steps 1–3 with the rest addressed in subsequent posts.
(As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up.
After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees.
I guess that’s how things go on the start-up scene!)
Problem Definition
The first step before we get coding is to understand the problem we are trying to solve and the available data.
In this project, we will work with publicly available building energy data from New York City.
The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors influence the score.
The data includes the Energy Star Score, makes this a supervised regression machine learning task:
Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
Regression: The Energy Star score is a continuous variable
We want to develop a model that is both accurate — it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions.
Once we know the goal, we can use it to guide our decisions as we dig into the data and build models.
Data Cleaning
Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets).
Real-world data is messy means we need to clean and wrangle it into an acceptable format before we can even start the analysis.
Data cleaning is an un-glamorous, but necessary part of most actual data science problems.
First, we can load in the data as a Pandas DataFrame and take a look:
import pandas as pd
import numpy as np
# Read in data into a dataframe
data = pd.read_csv('data/Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')
# Display top of dataframe
data.head()
What Actual Data Looks Like!
This is a subset of the full data contains 60 columns.
Already, we can see a couple issues: first, we know that we want to predict the ENERGY STAR Score but we don’t know what any of the columns mean.
While this isn’t necessarily an issue — we can often make an accurate model without any knowledge of the variables — we want to focus on interpretability, and it might be important to understand at least some of the columns.
When I originally got the assignment from the start-up, I didn’t want to ask what all the column names meant, so I looked at the name of the file,
and decided to search for “Local Law 84”.
That led me to this page explains this is an NYC law requiring all buildings of a certain size to report their energy use.
More searching brought me to all the definitions of the columns. Maybe looking at a file name is an obvious place to start, but for me this was a reminder to go slow so you don’t miss anything important!
We don’t need to study all of the columns, but we should at least understand the Energy Star Score, is described as:
A 1-to-100 percentile ranking based on self-reported energy usage for the reporting year.
The Energy Star score is a relative measure used for comparing the energy efficiency of buildings.
That clears up the first problem, but the second issue is that missing values are encoded as “Not Available”.
This is a string in Python means that even the columns with numbers will be stored as object datatypes because Pandas converts a column with any strings into a column of all strings.
We can see the datatypes of the columns using the dataframe.info()method:
# See the column data types and non-missing values
data.info()
Sure enough, some of the columns that clearly contain numbers (such as ft²), are stored as objects.
We can’t do numerical analysis on strings, so these will have to be converted to number (specifically float) data types!
Here’s a little Python code that replaces all the “Not Available” entries with not a number ( np.nan), can be interpreted as numbers, and then converts the relevant columns to the float datatype:
Once the correct columns are numbers, we can start to investigate the data.
Missing Data and Outliers
In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values.
These can arise for many reasons and have to be either filled in or removed before we train a machine learning model.
First, let’s get a sense of how many missing values are in each column (see the notebook for code).
(To create this table, I used a function from this Stack Overflow Forum).
While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model.
The threshold for removing columns should depend on the problem (here is a discussion), and for this project, we will remove any columns with more than 50% missing values.
At this point, we may also want to remove outliers.
These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values.
For this project, we will remove anomalies based on the definition of extreme outliers:
Below the first quartile − 3 ∗ interquartile range
Above the third quartile + 3 ∗ interquartile range
(For the code to remove the columns and the anomalies, see the notebook).
At the end of the data cleaning and anomaly removal process, we are left with over 11,000 buildings and 49 features.
Exploratory Data Analysis
Now that the tedious — but necessary — step of data cleaning is complete, we can move on to exploring our data! Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.
In short, the goal of EDA is to learn what our data can tell us.
It generally starts out with a high level overview, then narrows in to specific areas as we find interesting parts of the data.
The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide features to use.
Single Variable Plots
The goal is to predict the Energy Star Score (renamed to score in our data) so a reasonable place to start is examining the distribution of this variable.
A histogram is a simple yet effective way to visualize the distribution of a single variable and is easy to make using matplotlib.
import matplotlib.pyplot as plt
# Histogram of the Energy Star Score
plt.style.use('fivethirtyeight')
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k');
plt.xlabel('Score'); plt.ylabel('Number of Buildings');
plt.title('Energy Star Score Distribution');
This looks quite suspicious! The Energy Star score is a percentile rank, means we would expect to see a uniform distribution, with each score assigned to the same number of buildings.
However, a disproportionate number of buildings have either the highest, 100, or the lowest, 1, score (higher is better for the Energy Star score).
If we go back to the definition of the score, we see that it is based on “self-reported energy usage” might explain the very high scores.
Asking building owners to report their own energy usage is like asking students to report their own scores on a test! As a result, this probably is not the most objective measure of a building’s energy efficiency.
If we had an unlimited amount of time, we might want to investigate why so many buildings have very high and very low scores we could by selecting these buildings and seeing what they have in common.
However, our objective is only to predict the score and not to devise a better method of scoring buildings! We can make a note in our report that the scores have a suspect distribution, but our main focus in on predicting the score.
Looking for Relationships
A major part of EDA is searching for relationships between the features and the target.
Variables that are correlated with the target are useful to a model because they can be used to predict the target.
One way to examine the effect of a categorical variable (takes on only a limited set of values) on the target is through a density plot using the seaborn library.
A density plot can be thought of as a smoothed histogram because it shows the distribution of a single variable.
We can color a density plot by class to see how a categorical variable changes the distribution.
The following code makes a density plot of the Energy Star Score colored by the the type of building (limited to building types with more than 100 data points):
We can see that the building type has a significant impact on the Energy Star Score.
Office buildings tend to have a higher score while Hotels have a lower score.
This tells us that we should include the building type in our modeling because it does have an impact on the target.
As a categorical variable, we will have to one-hot encode the building type.
A similar plot can be used to show the Energy Star Score by borough:
The borough does not seem to have as large of an impact on the score as the building type.
Nonetheless, we might want to include it in our model because there are slight differences between the boroughs.
To quantify relationships between variables, we can use the Pearson Correlation Coefficient.
This is a measure of the strength and direction of a linear relationship between two variables.
A score of +1 is a perfectly linear positive relationship and a score of -1 is a perfectly negative linear relationship.
Several values of the correlation coefficient are shown below:
Values of the Pearson Correlation Coefficient (Source)
While the correlation coefficient cannot capture non-linear relationships, it is a good way to start figuring out how variables are related.
In Pandas, we can easily calculate the correlations between any columns in a dataframe:
# Find all correlations with the score and sort
correlations_data = data.corr()['score'].sort_values()
The most negative (left) and positive (right) correlations with the target:
There are several strong negative correlations between the features and the target with the most negative the different categories of EUI (these measures vary slightly in how they are calculated).
The EUI — Energy Use Intensity — is the amount of energy used by a building divided by the square footage of the buildings.
It is meant to be a measure of the efficiency of a building with a lower score being better.
Intuitively, these correlations make sense: as the EUI increases, the Energy Star Score tends to decrease.
Two-Variable Plots
To visualize relationships between two continuous variables, we use scatterplots.
We can include additional information, such as a categorical variable, in the color of the points.
For example, the following plot shows the Energy Star Score vs.
Site EUI colored by the building type:
This plot lets us visualize what a correlation coefficient of -0.7 looks like.
As the Site EUI decreases, the Energy Star Score increases, a relationship that holds steady across the building types.
The final exploratory plot we will make is known as the Pairs Plot.
This is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables.
Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.
To see interactions between variables, we look for where a row intersects with a column.
For example, to see the correlation of Weather Norm EUI with score, we look in the Weather Norm EUI row and the score column and see a correlation coefficient of -0.67.
In addition to looking cool, plots such as these can help us decide variables to include in modeling.
Feature Engineering and Selection
Feature engineering and selection often provide the greatest return on time invested in a machine learning problem.
First of all, let’s define what these two tasks are:
Feature engineering: The process of taking raw data and extracting or creating new features.
This might mean taking transformations of variables, such as a natural log and square root, or one-hot encoding categorical variables so they can be used in a model.
Generally, I think of feature engineering as creating additional features from the raw data.
Feature selection: The process of choosing the most relevant features in the data.
In feature selection, we remove features to help the model generalize better to new data and create a more interpretable model.
Generally, I think of feature selection as subtracting features so we are left with only those that are most important.
A machine learning model can only learn from the data we provide it, so ensuring that data includes all the relevant information for our task is crucial.
If we don’t feed a model the correct data, then we are setting it up to fail and we should not expect it to learn!
For this project, we will take the following feature engineering steps:
One-hot encode categorical variables (borough and property use type)
Add in the natural log transformation of the numerical variables
One-hot encoding is necessary to include categorical variables in a model.
A machine learning algorithm cannot understand a building type of “office”, so we have to record it as a 1 if the building is an office and a 0 otherwise.
Adding transformed features can help our model learn non-linear relationships within the data.
Taking the square root, natural log, or various powers of features is common practice in data science and can be based on domain knowledge or what works best in practice.
Here we will include the natural log of all numerical features.
The following code selects the numeric features, takes log transformations of these features, selects the two categorical features, one-hot encodes these features, and joins the two sets together.
This seems like a lot of work, but it is relatively straightforward in Pandas!
After this process we have over 11,000 observations (buildings) with 110 columns (features).
Not all of these features are likely to be useful for predicting the Energy Star Score, so now we will turn to feature selection to remove some of the variables.
Feature Selection
Many of the 110 features we have in our data are redundant because they are highly correlated with one another.
For example, here is a plot of Site EUI vs Weather Normalized Site EUI have a correlation coefficient of 0.997.
Features that are strongly correlated with each other are known as collinear and removing one of the variables in these pairs of features can often help a machine learning model generalize and be more interpretable.
(I should point out we are talking about correlations of features with other features, not correlations with the target, help our model!)
There are a number of methods to calculate collinearity between features, with one of the most common the variance inflation factor.
In this project, we will use thebcorrelation coefficient to identify and remove collinear features.
We will drop one of a pair of features if the correlation coefficient between them is greater than 0.6.
For the implementation, take a look at the notebook (and this Stack Overflow answer)
While this value may seem arbitrary, I tried several different thresholds, and this choice yielded the best model.
Machine learning is an empirical field and is often about experimenting and finding what performs best! After feature selection, we are left with 64 total features and 1 target.
# Remove any columns with all na values
features = features.dropna(axis=1, how = 'all')
print(features.shape)
(11319, 65)
Establishing a Baseline
We have now completed data cleaning, exploratory data analysis, and feature engineering.
The final step to take before getting started with modeling is establishing a naive baseline.
This is essentially a guess against we can compare our results.
If the machine learning models do not beat this guess, then we might have to conclude that machine learning is not acceptable for the task or we might need to try a different approach.
For regression problems, a reasonable naive baseline is to guess the median value of the target on the training set for all the examples in the test set.
This sets a relatively low bar for any model to surpass.
The metric we will use is mean absolute error (mae) measures the average absolute error on the predictions.
There are many metrics for regression, but I like Andrew Ng’s advice to pick a single metric and then stick to it when evaluating models.
The mean absolute error is easy to calculate and is interpretable.
Before calculating the baseline, we need to split our data into a training and a testing set:
The training set of features is what we provide to our model during training along with the answers.
The goal is for the model to learn a mapping between the features and the target.
The testing set of features is used to evaluate the trained model.
The model is not allowed to see the answers for the testing set and must make predictions using only the features.
We know the answers for the test set so we can compare the test predictions to the answers.
We will use 70% of the data for training and 30% for testing:
# Split into 70% training and 30% testing set
X, X_test, y, y_test = train_test_split(features, targets,
test_size = 0.3,
random_state = 42)
Now we can calculate the naive baseline performance:
The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164
The naive estimate is off by about 25 points on the test set.
The score ranges from 1–100, so this represents an error of 25%, quite a low bar to surpass!
Conclusions
In this article we walked through the first three steps of a machine learning problem.
After defining the question, we:
Cleaned and formatted the raw data
Performed an exploratory data analysis to learn about the dataset
Developed a set of features that we will use for our models
Finally, we also completed the crucial step of establishing a baseline against we can judge our machine learning algorithms.
The second post (available here) will show how to evaluate machine learning models using Scikit-Learn, select the best model, and perform hyperparameter tuning to optimize the model.
The third post, dealing with model interpretation and reporting results, is here.
众所周知,Python 语言的性能相比其他语言如 C/C++ 等要弱很多,所以当我们需要高性能的时候往往借助于 Python 的 C/C++ 扩展或者 Cython。
通常构建 Python C/C++ 扩展会使用 distutils 的 Extension 类,需要在 setup.py 中配置头文件包含路径 include_dirs、C/C++ 源文件路径 sources 等,比如下面这个Python 官方文档上的例子:
from distutils.core import setup, Extension
module1 = Extension(
'demo',
define_macros=[
('MAJOR_VERSION', '1'),
('MINOR_VERSION', '0')
],
include_dirs=['/usr/local/include'],
libraries =['tcl83'],
library_dirs=['/usr/local/lib'],
sources=['demo.c']
)
setup(
name='PackageName',
version='1.0',
description='This is a demo package',
author='Martin v. Loewis',
author_email='martin@v.loewis.de',
url='https://docs.python.org/extending/building',
long_description='''
This is really just a demo package.
''',
ext_modules=[module1]
)
这种方式对于绝大多数简单的项目应该是足够了,而当你需要用到一些 C/C++ 第三方库的时候可能会遇到因为某些原因需要将三方库的源码和项目源码一起进行编译的情况(比如 abseil-cpp),这个情况下往往会遇到 C/C++ 依赖管理的问题,CMake 则是常用的 C/C++ 依赖管理工具,本文将总结、分享一下使用 CMake 来构建 Python C/C++ 扩展的方案。
调研可选方案
首先来看一下 CMake 项目本身一般是如何构建的,一般 CMake 项目都会在项目根目录下有个 CMakeLists.txt 的 CMake 项目定义文件,构建方式通常如下:
mkdir build
cd build
cmake ..
make
那基本的思路就是在 Python 包构建过程中(pip install 或者 python setup.py install 等)调用上述命令完成扩展构建。通过 Google 搜索可以发现,一个方案是通过继承 distutils 的 Extension 来手工实现,另一个方案则是用别人写好的现成的封装库 scikit-build。
方案一 distutils CMake extension
这个方案有个现成的例子,pybind11 的 CMake 示例项目,BTW,pybind11 也是一个写 Python C++ 扩展的项目。看一下它的 setup.py 的代码:
import os
import re
import sys
import platform
import subprocess
from setuptools import setup, Extension
from setuptools.command.build_ext import build_ext
from distutils.version import LooseVersion
class CMakeExtension(Extension):
def __init__(self, name, sourcedir=''):
Extension.__init__(self, name, sources=[])
self.sourcedir = os.path.abspath(sourcedir)
class CMakeBuild(build_ext):
def run(self):
try:
out = subprocess.check_output(['cmake', '--version'])
except OSError:
raise RuntimeError("CMake must be installed to build the following extensions: " +
", ".join(e.name for e in self.extensions))
if platform.system() == "Windows":
cmake_version = LooseVersion(re.search(r'version\s*([\d.]+)', out.decode()).group(1))
if cmake_version < '3.1.0':
raise RuntimeError("CMake >= 3.1.0 is required on Windows")
for ext in self.extensions:
self.build_extension(ext)
def build_extension(self, ext):
extdir = os.path.abspath(os.path.dirname(self.get_ext_fullpath(ext.name)))
# required for auto-detection of auxiliary "native" libs
if not extdir.endswith(os.path.sep):
extdir += os.path.sep
cmake_args = ['-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=' + extdir,
'-DPYTHON_EXECUTABLE=' + sys.executable]
cfg = 'Debug' if self.debug else 'Release'
build_args = ['--config', cfg]
if platform.system() == "Windows":
cmake_args += ['-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_{}={}'.format(cfg.upper(), extdir)]
if sys.maxsize > 2**32:
cmake_args += ['-A', 'x64']
build_args += ['--', '/m']
else:
cmake_args += ['-DCMAKE_BUILD_TYPE=' + cfg]
build_args += ['--', '-j2']
env = os.environ.copy()
env['CXXFLAGS'] = '{} -DVERSION_INFO=\\"{}\\"'.format(env.get('CXXFLAGS', ''),
self.distribution.get_version())
if not os.path.exists(self.build_temp):
os.makedirs(self.build_temp)
subprocess.check_call(['cmake', ext.sourcedir] + cmake_args, cwd=self.build_temp, env=env)
subprocess.check_call(['cmake', '--build', '.'] + build_args, cwd=self.build_temp)
setup(
name='cmake_example',
version='0.0.1',
author='Dean Moldovan',
author_email='dean0x7d@gmail.com',
description='A test project using pybind11 and CMake',
long_description='',
ext_modules=[CMakeExtension('cmake_example')],
cmdclass=dict(build_ext=CMakeBuild),
zip_safe=False,
)
可以看出,它通过重写 setuptools 的 build_ext cmdclass 在构建过程中调用了 cmake 命令完成扩展的构建。
这个方案比较适合 pybind11 的项目,因为它已经提供了很多 CMake 的 module 比如怎么找到 Python.h、libpython 等,打开示例项目的 CMakeLists.txt 可以发现它使用了一个 pybind11 提供的 CMake 函数 pybind11_add_module 来定义 Python 扩展,免去了很多繁琐的配置。
cmake_minimum_required(VERSION 2.8.12)
project(cmake_example)
add_subdirectory(pybind11)
pybind11_add_module(cmake_example src/main.cpp)
如果不使用 pybind11 则比较麻烦,看看 Apache Arrow Python 包的 CMakeLists.txt 感受一下。
方案二 scikit-build
scikit-build 是一个增强的 Python C/C++/Fortan/Cython 扩展构建系统生成器,本质上也是 Python setuptools 和 CMake 的胶水。
我们看一下 sciket-build 的 hello-cpp 示例:
setup.py
import sys
from skbuild import setup
# Require pytest-runner only when running tests
pytest_runner = (['pytest-runner>=2.0,<3dev']
if any(arg in sys.argv for arg in ('pytest', 'test'))
else [])
setup_requires = pytest_runner
setup(
name="hello-cpp",
version="1.2.3",
description="a minimal example package (cpp version)",
author='The scikit-build team',
license="MIT",
packages=['hello'],
tests_require=['pytest'],
setup_requires=setup_requires
)
基本上就是一个 setuptools.setup 的完整替代,不再使用 from setuptools import set 转用 from skbuild import setup。
CMakeLists.txt
cmake_minimum_required(VERSION 3.4.0)
project(hello)
find_package(PythonExtensions REQUIRED)
add_library(_hello MODULE hello/_hello.cxx)
python_extension_module(_hello)
install(TARGETS _hello LIBRARY DESTINATION hello)
这里没有看到类似上面 pybind11 CMake 示例中的 add_subdirectory(pybind11) 语句,而是直接用的 find_package(PythonExtensions REQUIRED) 和 python_extension_module CMake 函数:
PythonExtensions 的 CMake 定义已经打包在 scikit-build 中
调用 skbuild.setup 的过程中 scikit-build 自动把它打包的 CMake 定义文件加载了所以上面才不需要像 pybind11 那样做
install(TARGETS _hello LIBRARY DESTINATION hello) 将构建好的扩展的动态链接库复制到 hello/ 目录中,从而可以在 Python 中使用 from hello._hello import hello 导入扩展中的 hello 函数
通常还会增加 pyproject.toml 来安装 pip 构建时候需要的依赖包:
[build-system]
requires = ["setuptools", "wheel", "scikit-build", "cmake", "ninja"]
比较有意思的是,scikit-build 并不需要你的系统上全局安装 CMake/Ninja,它打包了 manylinux 的 CMake 和 Ninja 的二进制 wheels 并发布到了 PyPi 上,cool.
scikit-build 还支持类似的方式构建使用 Cython 和 pybind11 等的扩展,功能强大非常方便。
后记
最近工作中完成了使用 CMake 和 scikit-build 改造一个 C++ 和 Cython 写的 Python 扩展项目以便能够使用 abseil-cpp 的 Swiss Tables 优化性能,这篇文章差不多就是 brain dump 一下调研的过程,后面我想写一下如何在 Cython 中使用 abseil-cpp 的 containers 的文章,stay tuned.
GENERATING C++ CODE USING PYTHON AND CMAKEBuilding and testing a hybrid Python/C++ packageC++ Developer GuideCython在 Cython 项目中使用 abseil-cpp
Python Exception Handling Using try, except and finally statement
Exceptions in Python
Python has many built-in exceptions that are raised when your program encounters an error.
When these exceptions occur, the Python interpreter stops the current process and passes it to the calling process until it is handled.
If not handled, the program will crash.
For example, let us consider a program where we have a function A that calls function B, which in turn calls function C.
If an exception occurs in function C but is not handled in C, the exception passes to B and then to A.
If never handled, an error message is displayed and our program comes to a sudden unexpected halt.
Catching Exceptions in Python
In Python, exceptions can be handled using a try statement.
The critical operation which can raise an exception is placed inside the try clause.
The code that handles the exceptions is written in the except clause.
We can thus choose what operations to perform once we have caught the exception.
Here is a simple example.
# import module sys to get the type of exception
import sys
randomList = ['a', 0, 2]
for entry in randomList:
try:
print("The entry is", entry)
r = 1/int(entry)
break
except:
print("Oops!", sys.exc_info()[0], "occurred.")
print("Next entry.")
print()
print("The reciprocal of", entry, "is", r)
Output
The entry is a
Oops! <class 'ValueError'> occurred.
Next entry.
The entry is 0
Oops! <class 'ZeroDivisionError'> occured.
Next entry.
The entry is 2
The reciprocal of 2 is 0.5
In this program, we loop through the values of the randomList list.
As previously mentioned, the portion that can cause an exception is placed inside the try block.
If no exception occurs, the except block is skipped and normal flow continues(for last value).
But if any exception occurs, it is caught by the except block (first and second values).
Here, we print the name of the exception using the exc_info() function inside sys module.
We can see that a causes ValueError and 0 causes ZeroDivisionError.
Since every exception in Python inherits from the base Exception class, we can also perform the above task in the following way:
# import module sys to get the type of exception
import sys
randomList = ['a', 0, 2]
for entry in randomList:
try:
print("The entry is", entry)
r = 1/int(entry)
break
except Exception as e:
print("Oops!", e.__class__, "occurred.")
print("Next entry.")
print()
print("The reciprocal of", entry, "is", r)
This program has the same output as the above program.
Catching Specific Exceptions in Python
In the above example, we did not mention any specific exception in the except clause.
This is not a good programming practice as it will catch all exceptions and handle every case in the same way.
We can specify which exceptions an except clause should catch.
A try clause can have any number of except clauses to handle different exceptions, however, only one will be executed in case an exception occurs.
We can use a tuple of values to specify multiple exceptions in an except clause.
Here is an example pseudo code.
try:
# do something
pass
except ValueError:
# handle ValueError exception
pass
except (TypeError, ZeroDivisionError):
# handle multiple exceptions
# TypeError and ZeroDivisionError
pass
except:
# handle all other exceptions
pass
Raising Exceptions in Python
In Python programming, exceptions are raised when errors occur at runtime.
We can also manually raise exceptions using the raise keyword.
We can optionally pass values to the exception to clarify why that exception was raised.
>>> raise KeyboardInterrupt
Traceback (most recent call last):
...
KeyboardInterrupt
>>> raise MemoryError("This is an argument")
Traceback (most recent call last):
...
MemoryError: This is an argument
>>> try:
...
a = int(input("Enter a positive integer: "))
...
if a <= 0:
...
raise ValueError("That is not a positive number!")
...
except ValueError as ve:
...
print(ve)
...
Enter a positive integer: -2
That is not a positive number!
Python try...finally
The try statement in Python can have an optional finally clause.
This clause is executed no matter what, and is generally used to release external resources.
For example, we may be connected to a remote data center through the network or working with a file or a Graphical User Interface (GUI).
In all these circumstances, we must clean up the resource before the program comes to a halt whether it successfully ran or not.
These actions (closing a file, GUI or disconnecting from network) are performed in the finally clause to guarantee the execution.
Here is an example of file operations to illustrate this.
try:
f = open("test.txt",encoding = 'utf-8')
# perform file operations
finally:
f.close()
This type of construct makes sure that the file is closed even if an exception occurs during the program execution.
Sometimes it is useful to display three-dimensional data in two dimensions using contours or color-coded regions.
There are three Matplotlib functions that can be helpful for this task: plt.contour for contour plots, plt.contourf for filled contour plots, and plt.imshow for showing images.
This section looks at several examples of using these.
We'll start by setting up the notebook for plotting and importing the functions we will use:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
Visualizing a Three-Dimensional Function
We'll start by demonstrating a contour plot using a function $z = f(x, y)$, using the following particular choice for $f$ (we've seen this before in Computation on Arrays: Broadcasting, when we used it as a motivating example for array broadcasting):
In [2]:
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
A contour plot can be created with the plt.contour function.
It takes three arguments: a grid of x values, a grid of y values, and a grid of z values.
The x and y values represent positions on the plot, and the z values will be represented by the contour levels.
Perhaps the most straightforward way to prepare such data is to use the np.meshgrid function, which builds two-dimensional grids from one-dimensional arrays:
In [3]:
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
Now let's look at this with a standard line-only contour plot:
In [4]:
plt.contour(X, Y, Z, colors='black');
Notice that by default when a single color is used, negative values are represented by dashed lines, and positive values by solid lines.
Alternatively, the lines can be color-coded by specifying a colormap with the cmap argument.
Here, we'll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range:
In [5]:
plt.contour(X, Y, Z, 20, cmap='RdGy');
Here we chose the RdGy (short for Red-Gray) colormap, which is a good choice for centered data.
Matplotlib has a wide range of colormaps available, which you can easily browse in IPython by doing a tab completion on the plt.cm module:
plt.cm.<TAB>
Our plot is looking nicer, but the spaces between the lines may be a bit distracting.
We can change this by switching to a filled contour plot using the plt.contourf() function (notice the f at the end), which uses largely the same syntax as plt.contour().
Additionally, we'll add a plt.colorbar() command, which automatically creates an additional axis with labeled color information for the plot:
In [6]:
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
The colorbar makes it clear that the black regions are "peaks," while the red regions are "valleys."
One potential issue with this plot is that it is a bit "splotchy." That is, the color steps are discrete rather than continuous, which is not always what is desired.
This could be remedied by setting the number of contours to a very high number, but this results in a rather inefficient plot: Matplotlib must render a new polygon for each step in the level.
A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional grid of data as an image.
The following code shows this:
In [7]:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
There are a few potential gotchas with imshow(), however:
plt.imshow() doesn't accept an x and y grid, so you must manually specify the extent [xmin, xmax, ymin, ymax] of the image on the plot.
plt.imshow() by default follows the standard image array definition where the origin is in the upper left, not in the lower left as in most contour plots.
This must be changed when showing gridded data.
plt.imshow() will automatically adjust the axis aspect ratio to match the input data; this can be changed by setting, for example, plt.axis(aspect='image') to make x and y units match.
Finally, it can sometimes be useful to combine contour plots and image plots.
For example, here we'll use a partially transparent background image (with transparency set via the alpha parameter) and overplot contours with labels on the contours themselves (using the plt.clabel() function):
In [8]:
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy', alpha=0.5)
plt.colorbar();
The combination of these three functions—plt.contour, plt.contourf, and plt.imshow—gives nearly limitless possibilities for displaying this sort of three-dimensional data within a two-dimensional plot.
For more information on the options available in these functions, refer to their docstrings.
If you are interested in three-dimensional visualizations of this type of data, see Three-dimensional Plotting in Matplotlib.
Density Contours
Example simple contour plot
import numpy as np
from matplotlib.colors import LogNorm
from matplotlib import pyplot as plt
plt.interactive(True)
fig=plt.figure(1)
plt.clf()
# generate input data; you already have that
x1 = np.random.normal(0,10,100000)
y1 = np.random.normal(0,7,100000)/10.
x2 = np.random.normal(-15,7,100000)
y2 = np.random.normal(-10,10,100000)/10.
x=np.concatenate([x1,x2])
y=np.concatenate([y1,y2])
# calculate the 2D density of the data given
counts,xbins,ybins=np.histogram2d(x,y,bins=100,normed=LogNorm())
# make the contour plot
plt.contour(counts.transpose(),extent=[xbins.min(),xbins.max(),
ybins.min(),ybins.max()],linewidths=3,colors='black',
linestyles='solid')
plt.show()
produces a nice contour plot.
The contour function offers a lot of fancy adjustments, for example let's set the levels by hand:
plt.clf()
mylevels=[1.e-4, 1.e-3, 1.e-2]
plt.contour(counts.transpose(),mylevels,extent=[xbins.min(),xbins.max(),
ybins.min(),ybins.max()],linewidths=3,colors='black',
linestyles='solid')
plt.show()
producing this plot:
And finally, in SM one can do contour plots on linear and log scales, so I spent a little time trying to figure out how to do this in matplotlib.
Here is an example when the y points need to be plotted on the log scale and the x points still on the linear scale:
plt.clf()
# this is our new data which ought to be plotted on the log scale
ynew=10**y
# but the binning needs to be done in linear space
counts,xbins,ybins=np.histogram2d(x,y,bins=100,normed=LogNorm())
mylevels=[1.e-4,1.e-3,1.e-2]
# and the plotting needs to be done in the data (i.e., exponential) space
plt.contour(xbins[:-1],10**ybins[:-1],counts.transpose(),mylevels,
extent=[xbins.min(),xbins.max(),ybins.min(),ybins.max()],
linewidths=3,colors='black',linestyles='solid')
plt.yscale('log')
plt.show()
This produces a plot which looks very similar to the linear one, but with a nice vertical log axis, which is what was intended:
repeatingtimer
repeatingtimer.py
from threading import _Timer
class Timer(_Timer):
"""
See: https://hg.python.org/cpython/file/2.7/Lib/threading.py#l1079
"""
def run(self):
while not self.finished.is_set():
self.finished.wait(self.interval)
self.function(*self.args, **self.kwargs)
self.finished.set()
Python Data Analysis
A Note About Python Versions
All examples in this cheat sheet use Python 3.
We recommend using the latest stable version of Python, for example, Python 3.8.
You can check which version you have installed on your machine by running the following command in the system shell:
Sometimes, a development machine will have Python 2 and Python 3 installed side by side.
Having two Python versions available is common on macOS.
If that is the case for you, you can use the python3 command to run Python 3 even if Python 2 is the default in your environment:
If you don’t have Python 3 installed yet, visit the Python Downloads page for instructions on installing it.
Launch a Python interpreter by running the python3 command in your shell:
Libraries and Imports
The easiest way to install Python modules that are needed for data analysis is to use pip.
Installing NumPy and Pandas takes only a few seconds:
Once you’ve installed the modules, use the import statement to make the modules available in your program:
Getting Help With Python Data Analysis Functions
If you get stuck, the built-in Python docs are a great place to check for tips and ways to solve the problem.
The Python help() function displays the help article for a method or a class:
The help function uses the system text pagination program, also known as the pager, to display the documentation.
Many systems use less as the default text pager, just in case you aren’t familiar with the Vi shortcuts here are the basics:
j and k navigate up and down line by line.
/ searches for content in a documentation page.
After pressing / type in the search query, press Enter to go to the first occurrence.
Press n and N to go forward and back through the search results.
Ctrl+d and Ctrl+u move the cursor one page down and one page up, respectively.
Another useful place to check out for help articles is the online documentation for Python data analysis modules like Pandas and NumPy.
For example, the Pandas user guides cover all the Pandas functionality with explanations and examples.
Basic language features
A quick tour through the Python basics:
There are many more useful string methods in Python, find out more about them in the Python string docs.
Working with data sources
Pandas provides a number of easy-to-use data import methods, including CSV and TSV import, copying from the system clipboard, and reading and writing JSON files.
This is sufficient for most Python data analysis tasks:
Find all other Pandas data import functions in the Pandas docs.
Working with Pandas Data Frames
Pandas data frames are a great way to explore, clean, tweak, and filter your data sets while doing data analysis in Python.
This section covers a few of the things you can do with your Pandas data frames.
Exploring data
Here are a few functions that allow you to easily know more about the data set you are working on:
Statistical operations
All standard statistical operations like minimums, maximums, and custom quantiles are present in Pandas:
Cleaning the Data
It is quite common to have not-a-number (NaN) values in your data set.
To be able to operate on a data set with statistical methods, you’ll first need to clean up the data.
The fillna and dropna Pandas functions are a convenient way to replace the NaN values with something more representative for your data set, for example, a zero, or to remove the rows with NaN values from the data frame.
Filtering and sorting
Here are some basic commands for filtering and sorting the data in your data frames.
Machine Learning
While machine learning algorithms can be incredibly complex, Python’s popular modules make creating a machine learning program straightforward.
Below is an example of a simple ML algorithm that uses Python and its data analysis and machine learning modules, namely NumPy, TensorFlow, Keras, and SciKit-Learn.
In this program, we generate a sample data set with pizza diameters and their respective prices, train the model on this data set, and then use the model to predict the price of a pizza of a diameter that we choose.
Once the model is set up we can use it to predict a result:
For more details on the functionality available in Pandas, visit the Pandas user guides.
For more powerful math with NumPy (it can be used together with Pandas), check out the NumPy getting started guide.
https://www.youtube.com/c/KGMIT/playlists
Keith Galli
https://www.youtube.com/watch?v=GjKQ6V_ViQE
Comprehensive Python Beautiful Soup Web Scraping Tutorial! (find/find_all, css select, scrape table)
https://github.com/KeithGalli/web-scraping/blob/master/web_scraping_tutorial.ipynb
SAMPLE CODE
https://www.youtube.com/watch?v=zucvHSQsKHA&t=241s
Python Web Scraping - Should I use Selenium, Beautiful Soup or Scrapy? [2020]
https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
How To Crawl A Web Page with Scrapy and Python 3
import requests
from bs4 import BeautifulSoup
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib
print(soup.prettify())
soup = BeautifulSoup(r.content, 'html5lib')
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[] # a list to store quotes
table = soup.find('div', attrs = {'id':'all_quotes'})
for row in table.findAll('div',
attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quote['lines'] = row.img['alt'].split(" #")[0]
quote['author'] = row.img['alt'].split(" #")[1]
quotes.append(quote)
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
⇧
SimpleWebSocketServer
simple_http_server
urllib
from simple_websocket_server import WebSocketServer, WebSocket
import simple_http_server
import urllib
PORT = 9097
The SimpleWebSocketServer and the simple_http_server listen to the incoming requests, and the urllib module fetches the target web pages.
We can also initialize the port, as shown below.
Get Requests:
⇧
We define a function do_GET that will be called for all GET requests.
class MyProxy(simple_http_server.SimpleHTTPRequestHandler):
def do_GET(self):
url=self.path[1:]
self.send_response(200)
self.end_headers()
self.copyfile(urllib.urlopen(url), self.wfile)
Removing the URL slash
⇧
The URL that we pass in the above code will have a slash (/) at the beginning from the browsers.
We can remove the slash using the below code.
url=self.path[1:]
Sending The Headers
⇧
We have to send the headers as browsers need them for reporting a successful fetch with the HTTP status code of 200.
self.send_response(200)
self.end_headers()
self.copyfile(urllib.urlopen(url), self.wfile)
We used the urllib library in the last line to fetch the URL.
We wrote the URL back to the browser using the copyfile function.
Using The TCP Server:
⇧
We will use the ForkingTCPServer mode and pass it to the above class for interrupt handling.
httpd = WebSocketServer.ForkingTCPServer(('', PORT), MyProxy)
httpd.serve_forever()
You can save your file as ProxyServer.py and run it.
Then you can call it from the browser.
Your whole code will look like this.
from simple_websocket_server import WebSocketServer, WebSocket
import simple_http_server
import urllib
PORT = 9097
MyProxy(simple_http_server.SimpleHTTPRequestHandler):
def do_GET(self):
url=self.path[1:]
self.send_response(200)
self.end_headers()
self.copyfile(urllib.urlopen(url), self.wfile)
httpd = WebSocketServer.ForkingTCPServer(('', PORT), MyProxy)
print ("Now serving at" str(PORT))
httpd.serve_forever()
Whenever we type an address on our browser, our device sends a request to the web host of our destination website.
When the web host receives the request, it sends the web page of our target website back to our device.
The web host only sends the page back to us if it knows our internet protocol, i.e., IP address.
Thus, the target website knows the general location from where we are browsing because we sent out our IP address when we requested to browse the website.
Most likely, the web host may be able to access our ISP (Internet Service Provider) account name with the help of our IP address.
Advantages Of Using An Anonymous Proxy
There are lots of advantages to using an anonymous proxy server.
We must be aware of its benefits to understand how it can help us in our organization or any business.
Following are some of the pros of using anonymous proxy servers:
The most obvious benefit of anonymous proxy servers is that they give us some semblance of privacy.
It essentially substitutes its IP address in place of ours and allows us to bypass geo-blocking.
For instance, a video streaming website provides access to viewers of specific countries and blocks requests from other countries.
We can bypass this restriction by connecting to a proxy server in any country to access the video streaming website.
Public WiFi may prevent us from browsing certain websites at some universities or offices.
We can get around this browsing restriction by using a proxy server.
An anonymous proxy server helps clients protect their vital information from hacking.
A proxy server is often used to access data, speeding up browsing because of its good cache system.
Rotating Proxies:
⇧
We can define proxy rotation as a feature that changes our IP address with every new request we send.
When we visit a website, we send a request that shows a destination server a lot of data, including our IP address.
For instance, we send many such requests when we gather data using a scraper( for generating leads).
So, the destination server gets suspicious and bans it when most requests come from the same IP.
Therefore, there must be a solution to change our IP address with each request we send.
That solution is a rotating proxy.
So, to avoid the needless hassle of getting a scraper for rotating IPs in web scraping, we can get rotating proxies and let our provider take care of the rotation.
Uses Of Proxies:
⇧ Web Scraping
E-commerce websites employ anti-scraping tools for monitoring IP addresses to detect those making multiple web requests.
It is where the use of proxies comes in.
They enable users to make several requests that have ordinarily been detected from different IP addresses.
Each web request is assigned a different IP address.
In this way, the webserver is tricked and thinks that all the web requests come from other devices.
Ad Verification
Ad verification allows advertisers to check if their ads are displayed on the right websites and seen by the right audiences.
The constant change of IP addresses accesses many different websites and thus verifies ads without IP blocks.
Accessing geo-restricted websites and data
The same content can look different or unavailable when accessed from specific locations.
The proxies allow us to access the necessary data regardless of geo-location.
The Best Proxy for Your Online Tasks:
⇧ ProxyScrape is one of the most popular and reliable proxy providers online.
Three proxy services include dedicated datacentre proxy servers, residential proxy servers, and premium proxy servers.
So, what is the best possible solution for a best alternate solution for how to create a proxy in python? Before answering that questions, it is best to see the features of each proxy server.
A dedicated datacenter proxyis best suited for high-speed online tasks, such as streaming large amounts of data (in terms of size) from various servers for analysis purposes.
It is one of the main reasons organizations choose dedicated proxies for transmitting large amounts of data in a short amount of time.
A dedicated datacenter proxy has several features, such as unlimited bandwidth and concurrent connections, dedicated HTTP proxies for easy communication, and IP authentication for more security.
With 99.9% uptime, you can rest assured that the dedicated datacenter will always work during any session.
Last but not least, ProxyScrape provides excellent customer service and will help you to resolve your issue within 24-48 business hours.
Next is a residential proxy.
Residential is a go-to proxy for every general consumer.
The main reason is that the IP address of a residential proxy resembles the IP address provided by ISP.
This means getting permission from the target server to access its data will be easier than usual.
The other feature of ProxyScrape’s residential proxy is a rotating feature.
A rotating proxy helps you avoid a permanent ban on your account because your residential proxy dynamically changes your IP address, making it difficult for the target server to check whether you are using a proxy or not.
Apart from that, the other features of a residential proxy are: unlimited bandwidth, along with concurrent connection, dedicated HTTP/s proxies, proxies at any time session because of 7 million plus proxies in the proxy pool, username and password authentication for more security, and last but not least, the ability to change the country server.
You can select your desired server by appending the country code to the username authentication.
The last one is the premium proxy. Premium proxies are the same as dedicated datacenter proxies.
The functionality remains the same.
The main difference is accessibility.
In premium proxies, the proxy list (the list that contains proxies) is made available to every user on ProxyScrape’s network.
That is why premium proxies cost less than dedicated datacenter proxies.
So, what is the best possible solution forthe best alternate solution for how to create a proxy in python? The answer would be “residential proxy” and "dedicated datacenter proxy" The reason is simple.
As said above, the residential proxy is a rotating proxy, meaning that your IP address would be dynamically changed over a period of time which can be helpful to trick the server by sending a lot of requests within a small time frame without getting an IP block.
Next, the best thing would be to change the proxy server based on the country.
You just have to append the country ISO_CODE at the end of the IP authentication or username and password authentication.
Datacenter proxy is blazing fast, and if you are an avid movie buff, then a datacenter proxy is the best companion to stream high-quality videos.
https://www.geeksforgeeks.org/creating-a-proxy-webserver-in-python-set-1/
Socket programming in python is very user friendly as compared to c.
The programmer need not worry about minute details regarding sockets.
In python, the user has more chance of focusing on the application layer rather than the network layer.
We would be developing a simple multi-threaded proxy server capable of handling HTTP traffic.
This is a naive implementation of a proxy server.
To begin with, we would achieve the process in 3 easy steps
1.Creating an incoming socket
⇧
We create a socket serverSocket in the __init__ method of the Server Class.
This creates a socket for the incoming connections.
We then bind the socket and then wait for the clients to connect.
def __init__(self, config):
# Shutdown on Ctrl+C
signal.signal(signal.SIGINT, self.shutdown)
# Create a TCP socket
self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Re-use the socket
self.serverSocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
# bind the socket to a public host, and a port
self.serverSocket.bind((config['HOST_NAME'], config['BIND_PORT']))
self.serverSocket.listen(10) # become a server socket
self.__clients = {}
2.Accept client and process
⇧
This is the easiest yet the most important of all the steps.
We wait for the client’s connection request and once a successful connection is made, we dispatch the request in a separate thread, making ourselves available for the next request.
This allows us to handle multiple requests simultaneously which boosts the performance of the server multifold times.
while True:
# Establish the connection
(clientSocket, client_address) = self.serverSocket.accept()
d = threading.Thread(name=self._getClientName(client_address),
target = self.proxy_thread, args=(clientSocket, client_address))
d.setDaemon(True)
d.start()
3. Redirecting the traffic
⇧
The main feature of a proxy server is to act as an intermediate between source and destination.
Here, we would be fetching data from source and then pass it to the client.
First, we extract the URL from the received request data.
# get the request from browser
request = conn.recv(config['MAX_REQUEST_LEN'])
# parse the first line
first_line = request.split('\n')[0]
# get url
url = first_line.split(' ')[1]
Then, we find the destination address of the request.
Address is a tuple of (destination_ip_address, destination_port_no).
We will be receiving data from this address.
http_pos = url.find("://") # find pos of ://
if (http_pos==-1):
temp = url
else:
temp = url[(http_pos+3):] # get the rest of url
port_pos = temp.find(":") # find the port pos (if any)
# find end of web server
webserver_pos = temp.find("/")
if webserver_pos == -1:
webserver_pos = len(temp)
webserver = ""
port = -1
if (port_pos==-1 or webserver_pos < port_pos):
# default port
port = 80
webserver = temp[:webserver_pos]
else: # specific port
port = int((temp[(port_pos+1):])[:webserver_pos-port_pos-1])
webserver = temp[:port_pos]
Now, we setup a new connection to the destination server (or remote server), and then send a copy of the original request to the server.
The server will then respond with a response.
All the response messages use the generic message format of RFC 822.
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(config['CONNECTION_TIMEOUT'])
s.connect((webserver, port))
s.sendall(request)
We then redirect the server’s response to the client.
conn is the original connection to the client.
The response may be bigger than MAX_REQUEST_LEN that we are receiving in one call, so, a null response marks the end of the response.
while 1:
# receive data from web server
data = s.recv(config['MAX_REQUEST_LEN'])
if (len(data) > 0):
conn.send(data) # send to browser/client
else:
break
We then close the server connections appropriately and do the error handling to make sure the server works as expected.
How to test the server?
⇧
1. Run the server on a terminal.
Keep it running and switch to your favorite browser.
2. Go to your browser’s proxy settings and change the proxy server to ‘localhost’ and port to ‘12345’.
3. Now open any HTTP website (not HTTPS), for eg. geeksforgeeks.org and volla !! you should be able to access the content on the browser.
Once the server is running, we can monitor the requests coming to the client.
We can use that data to monitor the content that is going or we can develop statistics based on the content.
We can even restrict access to a website or blacklist an IP address.
We would be dealing with more such features in the upcoming tutorials.
What next? We would be adding the following features in our proxy server in the upcoming tutorials.
– Blacklisting Domains – Content monitoring – Logging – HTTP WebServer + ProxyServer
The whole working source code of this tutorial is available here Creating a Proxy Webserver in Python | Set 2 If you have any questions/comments then feel free to post them in the comments section.
features are added
⇧
A few interesting features are added to make it more useful.
Add blacklisting of domains.
⇧
For Ex. google.com, facebook.com. Create a list of BLACKLIST_DOMAINS in our configuration dict. For now, just ignore/drop the requests received for blacklisted domains. (Ideally, we must respond with a forbidden response.)
# Check if the host:port is blacklisted
for i in range(0, len(config['BLACKLIST_DOMAINS'])):
if config['BLACKLIST_DOMAINS'][i] in url:
conn.close()
return
To add host blocking:
⇧
Say, you may need to allow connections from a particular subnet or connection for a particular person. To add this, create a list of all the allowed hosts. Since the hosts can be a subnet as well, add regex for matching the IP addresses, specifically IPV4 addresses. “ IPv4 addresses are canonically represented in dot-decimal notation, which consists of four decimal numbers, each ranging from 0 to 255, separated by dots, e.g., 172.16.254.1. Each part represents a group of 8 bits (octet) of the address.”
_ishostAllowed in Server class, and use the fnmatch module to match regexes. Iterate through all the regexes and allow request if it matches any of them. If a client address is not found to be a part of any regex, then send a FORBIDDEN response. Again, for now, skip this response creation part.
Note: We would be creating a full-fledged custom webserver in upcoming tutorials, their creation of a createResponse function will be done to handle the generic response creation.
def _ishostAllowed(self, host):
""" Check if host is allowed to access
the content """
for wildcard in config['HOST_ALLOWED']:
if fnmatch.fnmatch(host, wildcard):
return True
return False
Default host match regex would be ‘*’ to match all the hosts. Though, regex of the form ‘192.168.*’ can also be used. The server currently processes requests but does not show any messages, so we are not aware of the state of the server. Its messages should be logged onto the console. For this purpose, use the logging module as it is thread-safe. (server is multi-threaded if you remember.)
Import module and setup its initial configuration.
⇧
logging.basicConfig(level = logging.DEBUG,
format = '[%(CurrentTime)-10s] (%(ThreadName)-10s) %(message)s',)
Create a separate method that logs every message:
Pass it as an argument, with additional data such as thread-name and current-time to keep track of the logs. Also, create a function that colorizes the logs so that they look pretty on STDOUT.
To achieve this, add a boolean in configuration, COLORED_LOGGING, and create a new function that colorizes every msg passed to it based on the LOG_LEVEL.
def log(self, log_level, client, msg):
""" Log the messages to appropriate place """
LoggerDict = {
'CurrentTime' : strftime("%a, %d %b %Y %X", localtime()),
'ThreadName' : threading.currentThread().getName()
}
if client == -1: # Main Thread
formatedMSG = msg
else: # Child threads or Request Threads
formatedMSG = '{0}:{1} {2}'.format(client[0], client[1], msg)
logging.debug('%s', utils.colorizeLog(config['COLORED_LOGGING'],
log_level, formatedMSG), extra=LoggerDict)
Create a new module, ColorizePython.py
⇧
It contains a pycolors class that maintains a list of color codes. Separate this into another module in order to make code modular and to follow PEP8 standards.
# ColorizePython.py
class pycolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m' # End color
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
Module:
import ColorizePython
Method:
def colorizeLog(shouldColorize, log_level, msg):
## Higher is the log_level in the log()
## argument, the lower is its priority.
colorize_log = {
"NORMAL": ColorizePython.pycolors.ENDC,
"WARNING": ColorizePython.pycolors.WARNING,
"SUCCESS": ColorizePython.pycolors.OKGREEN,
"FAIL": ColorizePython.pycolors.FAIL,
"RESET": ColorizePython.pycolors.ENDC
}
if shouldColorize.lower() == "true":
if log_level in colorize_log:
return colorize_log[str(log_level)] + msg + colorize_log['RESET']
return colorize_log["NORMAL"] + msg + colorize_log["RESET"]
return msg
Since the colorizeLog is not a function of a server-class, it is created as a separate module named utils.py which stores all the utility that makes code easier to understand and put this method there. Add appropriate log messages wherever required, especially whenever the state of the server changes.
Modify the shutdown method in the server to exit all the running threads before exiting the application. threading.enumerate() iterates over all the running threads, so we do not need to maintain a list of them. The behavior of the threading module is unexpected when we try to end the main_thread. The official documentation also states this:
“join() raises a RuntimeError if an attempt is made to join the current thread as that would cause a deadlock. It is also an error to join() a thread before it has been started and attempts to do so raises the same exception.”
So, skip it appropriately. Here’s the code for the same.
def shutdown(self, signum, frame):
""" Handle the exiting server. Clean all traces """
self.log("WARNING", -1, 'Shutting down gracefully...')
main_thread = threading.currentThread() # Wait for all clients to exit
for t in threading.enumerate():
if t is main_thread:
continue
self.log("FAIL", -1, 'joining ' + t.getName())
t.join()
self.serverSocket.close()
sys.exit(0)
Build simple proxy server in PythonBuild Simple proxy in Python in just 17 lines of code
OpenCV Python Tutorial
OpenCV Python Tutorial
import cv2
img = cv2.imread('assets/logo.jpg', 1)
img = cv2.resize(img, (0, 0), fx=0.5, fy=0.5)
img = cv2.rotate(img, cv2.cv2.ROTATE_90_CLOCKWISE)
cv2.imwrite('new_img.jpg', img)
cv2.imshow('Image', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
import cv2
import random
img = cv2.imread('assets/logo.jpg', -1)
# Change first 100 rows to random pixels
for i in range(100):
for j in range(img.shape[1]):
img[i][j] = [random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)]
# Copy part of image
tag = img[500:700, 600:900]
img[100:300, 650:950] = tag
cv2.imshow('Image', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
import numpy as np
import cv2
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
width = int(cap.get(3))
height = int(cap.get(4))
image = np.zeros(frame.shape, np.uint8)
smaller_frame = cv2.resize(frame, (0, 0), fx=0.5, fy=0.5)
image[:height//2, :width//2] = cv2.rotate(smaller_frame, cv2.cv2.ROTATE_180)
image[height//2:, :width//2] = smaller_frame
image[:height//2, width//2:] = cv2.rotate(smaller_frame, cv2.cv2.ROTATE_180)
image[height//2:, width//2:] = smaller_frame
cv2.imshow('frame', image)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
import numpy as np
import cv2
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
width = int(cap.get(3))
height = int(cap.get(4))
img = cv2.line(frame, (0, 0), (width, height), (255, 0, 0), 10)
img = cv2.line(img, (0, height), (width, 0), (0, 255, 0), 5)
img = cv2.rectangle(img, (100, 100), (200, 200), (128, 128, 128), 5)
img = cv2.circle(img, (300, 300), 60, (0, 0, 255), -1)
font = cv2.FONT_HERSHEY_SIMPLEX
img = cv2.putText(img, 'Tim is Great!', (10, height - 10), font, 2, (0, 0, 0), 5, cv2.LINE_AA)
cv2.imshow('frame', img)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
import numpy as np
import cv2
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
width = int(cap.get(3))
height = int(cap.get(4))
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
lower_blue = np.array([90, 50, 50])
upper_blue = np.array([130, 255, 255])
mask = cv2.inRange(hsv, lower_blue, upper_blue)
result = cv2.bitwise_and(frame, frame, mask=mask)
cv2.imshow('frame', result)
cv2.imshow('mask', mask)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
import numpy as np
import cv2
img = cv2.imread('assets/chessboard.png')
img = cv2.resize(img, (0, 0), fx=0.75, fy=0.75)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
corners = cv2.goodFeaturesToTrack(gray, 100, 0.01, 10)
corners = np.int0(corners)
for corner in corners:
x, y = corner.ravel()
cv2.circle(img, (x, y), 5, (255, 0, 0), -1)
for i in range(len(corners)):
for j in range(i + 1, len(corners)):
corner1 = tuple(corners[i][0])
corner2 = tuple(corners[j][0])
color = tuple(map(lambda x: int(x), np.random.randint(0, 255, size=3)))
cv2.line(img, corner1, corner2, color, 1)
cv2.imshow('Frame', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
import numpy as np
import cv2
img = cv2.resize(cv2.imread('assets/soccer_practice.jpg', 0), (0, 0), fx=0.8, fy=0.8)
template = cv2.resize(cv2.imread('assets/shoe.PNG', 0), (0, 0), fx=0.8, fy=0.8)
h, w = template.shape
methods = [cv2.TM_CCOEFF, cv2.TM_CCOEFF_NORMED, cv2.TM_CCORR,
cv2.TM_CCORR_NORMED, cv2.TM_SQDIFF, cv2.TM_SQDIFF_NORMED]
for method in methods:
img2 = img.copy()
result = cv2.matchTemplate(img2, template, method)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)
if method in [cv2.TM_SQDIFF, cv2.TM_SQDIFF_NORMED]:
location = min_loc
else:
location = max_loc
bottom_right = (location[0] + w, location[1] + h)
cv2.rectangle(img2, location, bottom_right, 255, 5)
cv2.imshow('Match', img2)
cv2.waitKey(0)
cv2.destroyAllWindows()
import numpy as np
import cv2
cap = cv2.VideoCapture(0)
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_eye.xml')
while True:
ret, frame = cap.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.3, 5)
for (x, y, w, h) in faces:
cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 5)
roi_gray = gray[y:y+w, x:x+w]
roi_color = frame[y:y+h, x:x+w]
eyes = eye_cascade.detectMultiScale(roi_gray, 1.3, 5)
for (ex, ey, ew, eh) in eyes:
cv2.rectangle(roi_color, (ex, ey), (ex + ew, ey + eh), (0, 255, 0), 5)
cv2.imshow('frame', frame)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
import numpy as np
import cv2
cap = cv2.VideoCapture(0)
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_eye.xml')
while True:
ret, frame = cap.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.3, 5)
for (x, y, w, h) in faces:
cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 5)
roi_gray = gray[y:y+w, x:x+w]
roi_color = frame[y:y+h, x:x+w]
eyes = eye_cascade.detectMultiScale(roi_gray, 1.3, 5)
for (ex, ey, ew, eh) in eyes:
cv2.rectangle(roi_color, (ex, ey), (ex + ew, ey + eh), (0, 255, 0), 5)
cv2.imshow('frame', frame)
if cv2.waitKey(1) == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
OpenCV-Python TutorialsValues for OpenCV detectMultiScale() parameters
Python3 find the circle center from 3 pts
from math import sqrt
def findCircle(x1, y1, x2, y2, x3, y3) :
x12 = x1 - x2;
x13 = x1 - x3;
y12 = y1 - y2;
y13 = y1 - y3;
y31 = y3 - y1;
y21 = y2 - y1;
x31 = x3 - x1;
x21 = x2 - x1;
# x1^2 - x3^2
sx13 = pow(x1, 2) - pow(x3, 2);
# y1^2 - y3^2
sy13 = pow(y1, 2) - pow(y3, 2);
sx21 = pow(x2, 2) - pow(x1, 2);
sy21 = pow(y2, 2) - pow(y1, 2);
f = (((sx13) * (x12) + (sy13) *
(x12) + (sx21) * (x13) +
(sy21) * (x13)) // (2 *
((y31) * (x12) - (y21) * (x13))));
g = (((sx13) * (y12) + (sy13) * (y12) +
(sx21) * (y13) + (sy21) * (y13)) //
(2 * ((x31) * (y12) - (x21) * (y13))));
c = (-pow(x1, 2) - pow(y1, 2) -
2 * g * x1 - 2 * f * y1);
# eqn of circle be x^2 + y^2 + 2*g*x + 2*f*y + c = 0
# where centre is (h = -g, k = -f) and
# radius r as r^2 = h^2 + k^2 - c
h = -g;
k = -f;
sqr_of_r = h * h + k * k - c;
# r is the radius
r = round(sqrt(sqr_of_r), 5);
print("Centre = (", h, ", ", k, ")");
print("Radius = ", r);
# Driver code
if __name__ == "__main__" :
x1 = 1 ; y1 = 1;
x2 = 2 ; y2 = 4;
x3 = 5 ; y3 = 3;
findCircle(x1, y1, x2, y2, x3, y3);
Finding the “center of gravity” of multiple points
where points have unequal weights
import math
import nummpy
import math
def toCartesian(t):
latD,longD = t
latR = math.radians(latD)
longR = math.radians(longD)
return (
math.cos(latR)*math.cos(longR),
math.cos(latR)*math.sin(longR),
math.sin(latR)
)
def toSpherical(t):
x,y,z = t
r = math.hypot(x,y)
if r == 0:
if z > 0:
return (90,0)
elif z< 0:
return (-90,0)
else:
return None
else:
return (math.degrees(math.atan2(z, r)), math.degrees(math.atan2(y,x)))
xyz = numpy.asarray([0.0,0.0,0.0])
total = 0
for p in points:
weight = p["weight"]
total += weight
xyz += numpy.asarray(toCartesian((p["lat"],p["long"])))*weight
avgXYZ = xyz/total
avgLat, avgLong = toSpherical(avgXYZ)
print avgLat,avgLong
django find center of points
https://stackoverflow.com/questions/6671183/calculate-the-center-point-of-multiple-latitude-longitude-coordinate-pairs
from django.contrib.gis.geos import Point, MultiPoint
points = [
Point((145.137075, -37.639981)),
Point((144.137075, -39.639981)),
]
multipoint = MultiPoint(*points)
point = multipoint.centroid
简单文本类型数据
import pdfplumber as pr
import pandas as pd
pdf = pr.open('关于使用自有资金购买银行理财产品的进展公告.PDF')
ps = pdf.pages
pg = ps[3]
tables = pg.extract_tables()
table = tables[0]
print(table)
df = pd.DataFrame(table[1:],columns = table[0])
for i in range(len(table)):
for j in range(len(table[i])):
table[i][j] = table[i][j].replace('\n','')
df1 = pd.DataFrame(table[1:],columns = table[0])
df1.to_excel('page2.xlsx')
复杂型表格提取
import pdfplumber as pr
import pandas as pd
pdf = pr.open('关于使用自有资金购买银行理财产品的进展公告.PDF')
ps = pdf.pages
pg = ps[4]
tables = pg.extract_tables()
table = tables[0]
print(table)
df = pd.DataFrame(table[1:],columns = table[0])
for i in range(len(table)):
for j in range(len(table[i])):
table[i][j] = table[i][j].replace('\n','')
df1 = pd.DataFrame(table[1:],columns = table[0])
df2 = df1.iloc[2:,:]
df2 = df2.rename(columns = {"2019年12月31日":"2019年1-12月","2020年9月30日":"2020年1-9月"})
df2 = df2.loc[3:,:]
df1 = df1.loc[:1,:]
with pd.ExcelWriter('公司影响.xlsx') as i:
df1.to_excel(i,sheet_name='资产', index=False, header=True) #放入资产数据
df2.to_excel(i,sheet_name='营业',index=False, header=True) #放入营业数据
图片型表格提取
pip install pytesseract
http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe
import pytesseract
from PIL import Image
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = 'C://Program Files (x86)/Tesseract-OCR/tesseract.exe'
tiqu = pytesseract.image_to_string(Image.open('图片型.jpg'))
print(tiqu)
tiqu = tiqu.split('\n')
while '' in tiqu: #不能使用for
tiqu.remove('')
first = tiqu[:6]
second = tiqu[6:12]
third = tiqu[12:]
df = pd.DataFrame()
df[first[0]] = first[1:]
df[second[0]] = second[1:]
df[third[0]] = third[1:]
#df.to_excel('图片型表格.xlsx') #转为xlsx文件
我们的思路是用Tesseract-OCR来解析图片,得到一个字符串,接着对字符串运用split函数,把字符串变成列表同时删除\n。
encrypt and decrypt a string in python
USE cryptography.fernet.Fernet
Initialize a cryptographic key by calling cryptography.fernet.Fernet.generate_key().
Configure the encryption type to symmetric encryption by calling the function cryptography.fernet.Fernet(key) with the cryptographic key from step 1 as key.
Encrypt the string by calling cryptography.fernet.Fernet.encrypt(data) with data as the byte representation of a string.
Decrypt an encrypted string by using the key generated from step 1 and the encryption scheme from step 2. Call cryptography.fernet.Fernet.decrypt(token) with the encrypted message as token to get the original message.
key = Fernet.generate_key()
encryption_type = Fernet(key)
encrypted_message = encryption_type.encrypt(b"Hello World")
encode message
print(encrypted_message)
OUTPUT
b'gAAAAABefl-Ur385W0q0YNZM7rbUL_ImiFKBI05hEMIqhgf4FeUKyZFDUzIi3tqnCt6N4mAR2o8-ryPOOyJH32bvZEVjAG-YLg=='
decrypted_message = encryption_type.decrypt(encrypted_message)
Load a file into the python console
From the shell command line:
python file.py
From the Python command line
import file
or
from file import *
print colored text to the terminal
# install the Python termcolor module
from termcolor import colored
in Python 3:
print(colored('hello', 'red'), colored('world', 'green'))
python有链式比较的机制,在一行里支持多种运算符比较。
相当于拆分多个逻辑表达式,再进行逻辑与操作。
a = 5
print(2 < a < 8)
print(1 == a < 3)
输出:
TrueFalse
3、重复打印字符串
将一个字符串重复打印多次,一般使用循环实现,但有更简易的方式可以实现。
n = 5string = "Hello!"print(string * n)
输出:
Hello!Hello!Hello!Hello!Hello!
4、检查文件是否存在
我们知道Python有专门处理系统交互的模块-os,它可以处理文件的各种增删改查操作。
那如何检查一个文件是否存在呢?os模块可以轻松实现。
from os import path
def check_for_file():
print("Does file exist:", path.exists("data.csv"))
if __name__=="__main__":
check_for_file()
输出:
Does file exist: False
列表推导式是for循环的简易形式,可以在一行代码里创建一个新列表,同时能通过if语句进行判断筛选
def get_vowels(string):
return [vowel for vowel in string if vowel in 'aeiou'] print("Vowels are:", get_vowels('This is some random string'))
输出:
Vowels are: ['i', 'i', 'o', 'e', 'a', 'o', 'i']
7、计算代码执行时间
python中time模块提供了时间处理相关的各种函数方法,我们可以使用它来计算代码执行的时间。
import time
start_time = time.time()
total = 0
for i in range(10):
total += i
print("Sum:", total)
end_time = time.time()
time_taken = end_time - start_time
print("Time: ", time_taken)
输出:
Sum: 45Time: 0.0009975433349609375
Python提供了try...except...finally的方式来处理代码异常,当然还有其他组合的方式。
a, b = 1,0
try:
print(a/b)
except ZeroDivisionError:
print("Can not divide by zero")
finally:
print("Executing finally block")
输出:
Can not divide by zeroExecuting finally block
在一行代码中调用多个函数。
def add(a, b):
return a + b
def subtract(a, b):
return a - b
a, b = 5, 10
print((add if b > a else subtract)(a,b))
输出:
15
20、从列表中删除重复项
删除列表中重复项一般可以通过遍历来筛选去重,或者直接使用集合方法。
list1 = [1,2,3,3,4,'John', 'Ana', 'Mark', 'John']
# 方法1
def remove_duplicate(list_value):
return list(set(list_value))
print(remove_duplicate(list1))
# 方法2
result = []
[result.append(x) for x in list1 if x not in result]
print(result)
输出:
[1, 2, 3, 4, 'Ana', 'John', 'Mark']
[1, 2, 3, 4, 'John', 'Ana', 'Mark']
Linear Regression Machine Learning example
""" Linear Regression Machine Learning example:
### Uses data for machine age and time between failures ###
### Predict a model for the data, supervised ML ####
https://www.youtube.com/watch?v=2BusGJyn77E """
## Import packages
import tensorflow as tf
import numpy
import pandas as pd
import matplotlib.pyplot as plt
rng = numpy.random
#Define your spreadsheet
spreadsheet = 'LR_ML.xlsx'
data = pd.read_excel(spreadsheet)
#Define your useful columns of data
months = data['Machine Age (Months)'].values
MTBF = data['Mean Time Between Failure (Days)'].values
# HyperParameters
learning_rate = 0.02
training_epochs = 3000
#Parameter
display_step = 50
# Training Data (X,Y) Sets
train_X = numpy.asarray(months)
train_Y = numpy.asarray(MTBF)
#Specifying the length of the train_x data
n_samples = train_X.shape[0]
# tf Graph Input --- Setting the dtype for the placeholder information
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights This is initializing the guesses of the model for weight and bias
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model (y=WX+b)
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error This is the error in the calculation to try to minimize
error = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
# Note, minimize() knows to modify W and b because Variable objects are trainable=True by default
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(error)
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
# Start training
with tf.Session() as sess:
# Run the initializer
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(error, feed_dict={X: train_X, Y:train_Y})
print("Epoch:", '%04d' % (epoch+1), "error=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_error = sess.run(error, feed_dict={X: train_X, Y: train_Y})
print("Training error=", training_error, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
# Testing example, as requested (Issue #2)
test_X = numpy.asarray([2,4,6,8,10])
test_Y = numpy.asarray([25,23,21,19,17])
print("Testing... (Mean square loss Comparison)")
testing_error = sess.run(
tf.reduce_sum(tf.pow(pred - Y, 2)) / (2 * test_X.shape[0]),
feed_dict={X: test_X, Y: test_Y}) # same function as cost above
print("Testing error=", testing_error)
print("Absolute mean square loss difference:", abs(
training_error - testing_error))
plt.plot(test_X, test_Y, 'bo', label='Testing data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
python stock market realtime monitoring
Alpha vantage website:
https://www.alphavantage.co/
Full code from the video:
https://github.com/Derrick-Sherrill/DerrickSherrill.com/blob/master/stocks.py
stocks.py
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
import time
api_key = 'RNZPXZ6Q9FEFMEHM'
ts = TimeSeries(key=api_key, output_format='pandas')
data, meta_data = ts.get_intraday(symbol='MSFT', interval = '1min', outputsize = 'full')
print(data)
i = 1
#while i==1:
# data, meta_data = ts.get_intraday(symbol='MSFT', interval = '1min', outputsize = 'full')
# data.to_excel("output.xlsx")
# time.sleep(60)
close_data = data['4. close']
percentage_change = close_data.pct_change()
print(percentage_change)
last_change = percentage_change[-1]
if abs(last_change) > 0.0004:
print("MSFT Alert:" + str(last_change))
python file server
python -m http.server 8000
ip on hp
192.168.128.93:8000
ip on acer
192.168.128.77:8000
python ftp server
One line ftp server in pythonTwisted is an event-driven networking engine written in Python
pip install twisted
code:
from twisted.protocols.ftp import FTPFactory, FTPRealm
from twisted.cred.portal import Portal
from twisted.cred.checkers import AllowAnonymousAccess, FilePasswordDB
from twisted.internet import reactor
reactor.listenTCP(21, FTPFactory(Portal(FTPRealm('./'), [AllowAnonymousAccess()])))
reactor.run()
pyftpdlib
pyftpdlib is one of the very best ftp servers out there for python.
pip3 install pyftpdlib
python -m pyftpdlib
code:
from pyftpdlib import servers
from pyftpdlib.handlers import FTPHandler
address = ("0.0.0.0", 21) # listen on every IP on my machine on port 21
server = servers.FTPServer(address, FTPHandler)
server.serve_forever()
To get a list of command line options:
python3 -m pyftpdlib --help
To setup port 21 and writable
python -m pyftpdlib -p 21 -w
Usage: python -m pyftpdlib [options]
Start a stand alone anonymous FTP server.
Options:
-h, --help. show this help message and exit
-i ADDRESS, --interface=ADDRESS. specify the interface to run on (default all interfaces)
-p PORT, --port=PORT. specify port number to run on (default 2121)
-w, --write. grants write access for logged in user (default read-only)
-d FOLDER, --directory=FOLDER. specify the directory to share (default current directory)
-n ADDRESS, --nat-address=ADDRESS. the NAT address to use for passive connections
-r FROM-TO, --range=FROM-TO. the range of TCP ports to use for passive connections (e.g. -r 8000-9000)
-D, --debug. enable DEBUG logging evel
-v, --version. print pyftpdlib version and exit
-V, --verbose. activate a more verbose logging
-u USERNAME, --username=USERNAME. specify username to login with (anonymous login will be disabled and password required if supplied)
-P PASSWORD, --password=PASSWORD. specify a password to login with (username required to be useful)
enable FTP through Chrome on all Windows devices
In Chrome 81, FTP support is disabled by default, but you can enable it using the # enable-ftp flag.
Open Chrome and type “chrome://flags” in the address bar.
Once in the flags area, type “enable-ftp” in the search bar stating “search flags”.
When you see the “Enable support for FTP URLs” option tap where it says “Default”.
Tap “Enable” option.
Hit “Relaunch Now” option at the bottom of the page.
FTP using Chrome
You can download content via ftp://username:password@your-domain.com.
But at the moment Chrome does not support uploading of content via FTP.
To upload your files you may want to use FileZilla or CuteFTP.
Some web browsers, such as Microsoft Internet Explorer, can also be used for FTP purposes and konsoleH includes the File Manager, which allows you to transfer files to and from your upload area.
create a simple message box in Python
import ctypes # An included library with Python install.
ctypes.windll.user32.MessageBoxW(0, "Your text", "Your title", 1)
Or define a function (Mbox) like so:
import ctypes # An included library with Python install.
def Mbox(title, text, style):
return ctypes.windll.user32.MessageBoxW(0, text, title, style)
Mbox('Your title', 'Your text', 1)
Note the styles are as follows:
## Styles:
## 0 : OK
## 1 : OK | Cancel
## 2 : Abort | Retry | Ignore
## 3 : Yes | No | Cancel
## 4 : Yes | No
## 5 : Retry | Cancel
## 6 : Cancel | Try Again | Continue
Note: edited to use MessageBoxW instead of MessageBoxA
Python For Bluetooth
https://ukbaz.github.io/en/html/reference/bluetooth_overview/index.html
Back in 2015 I became aware of Bluetooth BLE Beacons and some of the things that could be done with them.
At the same time I was helping on a STEM initiative called Go4SET where I would help students build out ideas of how to solve problems they had observed in the world around them.
Their solution would show how electronics and software could be used to solve the problems.
As Python was the language of choice in the schools I was working with, I started to investigate how to scan for BLE Beacons using a Raspberry Pi.
Here we are in 2020 and I still don’t have a great solution for how to do this, but things have got better in that time and I’ve learnt some things along the way.
One of the keys things I’ve learnt is that there is a lot of out-of-date information on the internet about Bluetooth.
While I suspect my writings will (in time) add to the volume of out-of-date information on the internet about Bluetooth.
For now I am aiming for it to be of some help to someone coming to the topic a new.
So here is some Python-Linux-Bluetooth information that might help someone starting.
Bad Information
Many tutorials on the internet are done with command-line tools that have been deprecated, such as hcitool and hcidump.
If you see tutorials using the HCI (Host Controller Interface) socket then it is either out-of-date or at such a low level that it is best to stay away.
The command-line tools recommended by the BlueZ developers are
bluetoothctl or, if you need more control, btmgmt.
And instead of using
hcidump, use btmon.
I would also be very nervous about using a library that uses HCI sockets
for interfacing with the Bluetooth hardware on Linux.
More on the different programming interfaces later.
But BlueZ…Really?
During the years I’ve been playing around with Bluetooth on Linux I’ve seen people show their frustration with the way that BlueZ handles things.
And I see peoples point.
An example is that the HCI tools were deprecated and removed.
It is hard to find tutorials on how to use the new tools and answers to questions on the mailing list expect a certain level of knowledge.
It is also common for questions to go unanswered on the mailing list.
This is Open Source so they don’t owe anyone an answer.
However, I have also seen the developers show their frustration that people go off and do crazy things rather than how they had intended things to work.
I spent many years of my professional life as an Application Engineer for a software company.
My big learning from that time is that if you don’t show people how to use your tool (and make using it the way you wanted the easiest)
then smart people will workout their own way of doing it.
Having said all of that, the developers have settled on the DBus API and it is getting better and better.
The biggest barrier for most people is finding the “on-ramp” to learning about how to use it.
There are examples Python examples
in the repository, but frankly they are often of limited value.
BlueZ API
A list of the possible API’s starting from lowest level and going to the highest.
For most people, the higher the better.
HCI Socket
As I said earlier, this bypasses the bluetoothd that is running on the Linux system that is used by the desktop tools.
Using this is not a great idea unless you really, really know what you are doing.
All the information is available in the Bluetooth Core Specification
which runs to about 3,256 pages for the 5.2 version of the spec.
MGMT Socket
The BlueZ Bluetooth Mamagement API
is the next step up and the lowest level that the BlueZ developer recommend.
The problem for Python users is this bug
makes it difficult to access the mgmt socket.
There are other duplicate bugs on this in the system.
Until they are fixed, this remains off bounds for many Python users.
DBus API
This should be the go to level for most people wanting to interact with the BlueZ API’s.
However, it seems the number of people that have done things with DBus previously is a relatively small group and it is another level of indirection to learn.
There are a number of Python libraries that offer DBus bindings
for Python.
However, there isn’t just one library that is correct for all cases.
pydbus is one of the easier ones to get started with.
The BlueZ DBus API for interacting with the Bluetooth Adapter on your Raspberry Pi is documented at
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/doc/adapter-api.txt
This allows you to know that the DBus Service is (org.bluez).
The Object Path is less obvious from the documentation but is /org/bluez/hci0 by default on most Linux machines.
With this information we can quickly look to see properties from the adapter using Python.
The example below looks at name, if it is powered, and its mac adderess:
Python For Bluetooth
If you write applications on iOS or Android, then you will have seen there are some great libraries with API’s that hide much of the gnarly-ness of Bluetooth.
With Python there are not those libraries around with that level of abstraction for most things you might want to do.
So you might end up going a little deeper and needing to know some of the details of Bluetooth.
Libraries to help you Bluetooth
There are plenty of them out there.
I keep a list of many of them at:
https://github.com/ukBaz/python-bluezero/wiki
Most of them are pretty niche in what they do.
There are a number of them that are abondonware.
This isn’t surprising given how big Bluetooth is and the many things you can do with it.
It is also really hard to automate the testing of Python Bluetooth libraries and I think this is what ends up being the main reason why the libraries stay niche or abandoned.
More than one Bluetooth
Depending on where you are starting from there can be a number of details that can trip people up when they first engage with Bluetooth and code.
The first is that there are two different types of Bluetooth.
These are generally referred to as Classic and BLE.
Devices like the Raspberry Pi support both.
While the BBC micro:bit is BLE only.
If you try to use Classic (aka BR/EDR, aka rfcomm,
aka Serial port profile, aka spp, aka 1101,
aka 00001101-0000-1000-8000-00805f9b34fb) on the Raspberry Pi then it will never speak sensibly with a micro:bit.
Bluetooth Classic (BR/EDR) supports speeds up to about 24Mbps.
It was version 4.0 of the standard that introduced a low energy mode,
Bluetooth Low Energy (BLE or LE, also known as “Bluetooth Smart”),
that operates at 1Mbps.
This mode allows devices to leave their transmitters off most of the time.
As a result it is “Low Energy”.
These two modes have a different philosophy of how they behave.
Classic is a cable replacement.
It makes the connection and stays connected.
BLE is similar to a database where the transmitter is only on when it is being written to or read from.
Clients can also subscribe to notifications when data changes in the Generic ATTribute Profile (GATT).
In classic mode there is a server and a client.
The server advertises and the client connects.
With BLE there are different terms of peripheral and central.
A peripheral advertises and a central scans and connects.
In BLE you can also have a Broadcaster (beacon) which is a transmitter only
(connectionless) application.
The Observer (scanner) role is for receiver only connectionless applications.
Endianness
As with most communication protocols, data is chopped up in to bytes that are sent between the two devices.
When this is done there is a choice of what order those bytes are transmitted in.
This is referred to as endianness
The Bluetooth standard is little-endian which often trips people up that are looking at Bluetooth for the first time.
The exception to this is when looking at beacons.
As far as I can tell this seems to be because Apple did this when they brought out the iBeacon and many have followed that example.
Binary
Because Bluetooth has come out of the embedded world there are lots of binary numbers referring to things rather than nice string names.
Lots of values are
128-bits in length.
This means that when I want to look at the status of button A on a micro:bit I need to look in the GATT database for E95DDA90-251D-470A-A062-FA1922DFA9A8
In classic mode, the Serial Port Profile
(SPP) is normally referred to by the 16-bit hex value of 0x1101.
However, it is really an 128-bit value but because it is an official profile it can be shortened to a 16-bit value
Bluetooth Special Interest Group (SIG) Reserved Values
The SIG has the following number reserved and the xxxx below is replaced with the 16-bit value.
0000xxxx-0000-1000-8000-00805f9b34fb
If you see a tutorial that is using 16-bit values without using official SIG profiles then be suspicious if that is a good tutorial.
Asynchronous
There are parts of Bluetooth that just needs to be asynchronous.
Examples are when scanning for new devices or getting notifications from a peripheral.
While this is possible to do with Python, asynchronous isn’t the way most people learn Python.
For BlueZ, it works with the GLib event loop which will be familiar to people that have coded GUI’s in Python.
Pairing and Connecting
I have seen confusion between these two terms when people come to programming Bluetooth.
Pairing is about the two devices exchanging information so that the devices can communicate securely.
So pairing is a one-off activity to exchange credentials.
It is not always required as sometimes it is OK for devices to exchange information without being secure.
Especially if you are just learning as it simplifies the processes involved.
Connection needs to be done every time you want the devices to start communicating.
It is a straight forward step in the two devices already know about each other.
I typically recommend that the one-off setup of scanning and pairing is done manually with bluetoothctl.
RFCOMM (Or is that SPP?)
This is the most useful profile in classic mode for many activities in the maker community when you want ot exchange information between two boards that support Bluetooth serial connection.
From Python 3.3 this is supported within the standard socket library.
Below is an example of a client connecting to a server.
This assumes the pairing has already happened and will do the connection.
>>> import socket
>>> s = socket.socket(socket.AF_BLUETOOTH, socket.SOCK_STREAM, socket.BTPROTO_RFCOMM)
>>> s.connect(('B8:27:EB:22:57:E0', 1))
>>> s.send(b'Hello')
>>> s.recv(1024)
b'world'
>>> s.close()
If this just works then life is great.
If there are issues, then this is when Bluetooth can become more frustating.
Debugging is probably a separate post.
BLE (Or is that GATT)
With BLE there is not the same level of support from native Python so it is required to use the DBus API.
This means using the
Device
and
GATT.
The difficult piece with these is that it is not known ahead of connection what the DBus Object Path will be for the devices, GATT Services,
and GATT Characteristics we are interested in.
This results in the need to do a reverse look-up from the UUID to the object path.
This was the subject of a
kata
I held at my local Python user group.
Good To Know
This talk at Embedded Linux Conference gave lots of good insight in to how things are done with BlueZ.
It is worth a watch if you are interested in learning more.
Python, Bluetooth, and Windows…
In Python 3.9 it is going to be easier to use Bluetooth RFCOMM (Serial Port Profile) thanks to this submission: https://bugs.python.org/issue36590
範例 findmyphone.py 演示了使用一個 Python 小程式去尋找附近名稱為 My Phone 的藍芽裝置。範例如下所示,請自行修改 target_name 成你要尋找的藍芽裝置名稱即可,
import bluetooth
target_name = "My Phone"
target_address = None
nearby_devices = bluetooth.discover_devices()
for bdaddr in nearby_devices:
if target_name == bluetooth.lookup_name( bdaddr ):
target_address = bdaddr
break
if target_address is not None:
print "found target bluetooth device with address ", target_address
else:
print "could not find target bluetooth device nearby"
藍芽位址是由 xx:xx:xx:xx:xx:xx 的形式所組成,xx 為十六進制,怎麼查詢藍芽位址請看這篇,每個藍芽裝置都有個獨一無二的藍芽位址。但是如果我們要找”某個名稱”的藍芽裝置,而不是用藍芽位址去找,那會分成兩步驟:
以上述 findmyphone.py 為例,首先程式會先掃描附近的藍芽裝置,呼叫 discover_devices() 尋找附近的裝置(大概10秒),然後回傳一個列表,
再來,使用 lookup_name() 去連接上每個已偵測到的裝置,請求它們的裝置名稱,並且順便判斷名稱是不是我們要尋找的 My Phone target name,是的話會顯示找到並印出藍芽位址。
在區域內掃描藍芽裝置和查找裝置名稱這過程有時可能會失敗(空氣中其他的干擾等等不定因素,裝置很多,裝置在移動?!),discover_devices() 有會回傳 None,意味著無法用裝置名稱來進行後續的匹配,這時最好的解決方式就是多試幾次看看XD。
https://people.csail.mit.edu/albert/bluez-intro/c212.html
Ciphey
Installation
python3 -m pip install ciphey --upgrade
Windows Python defaults to install 32-bit.
Ciphey only supports 64-bit.
Make sure you're using 64-bit Python.
There are 3 ways to run Ciphey.
File Input ciphey -f encrypted.txt
Unqualified input ciphey -- "Encrypted input"
Normal way ciphey -t "Encrypted input"
To get rid of the progress bars, probability table, and all the noise use the quiet mode.
ciphey -t "encrypted text here" -q
For a full list of arguments, run ciphey --help.
Importing Ciphey
You can import Ciphey's main and use it in your own programs and code. from Ciphey.__main__ import main
需要的信息藏在class为board-item-main的div标签下的a标签内,因此我们需要获取其文本信息。
核心代码如下所示:
movie_name = doc('.board-item-main .board-item-content .movie-item-info p a').text()
获取主演信息
从上图可以看到,主演的信息位于board-item-main的子节点p标签内,因此我们可以这样获取主演信息。
核心代码如下所示:
p = doc('.board-item-main .board-item-content .movie-item-info')
star = p.children('.star').text()
获取上映时间
从前面的图片也可以看到,上映时间的信息与主演信息的节点是兄弟节点,所以我们可以这样写代码。
p = doc('.board-item-main .board-item-content .movie-item-info')
time = p.children('.releasetime').text()
获取评分
要获取每一部电影的评分相对要复杂一些,为什么这样说呢?我们来看下面的图片。
从上面的图片可以看到,整数部分与小数部分被分割了成了两部分。因此需要分别获取两部分的数据,在进行拼接即可。
核心代码如下所示:
score1 = doc('.board-item-main .movie-item-number.score-num .integer').text().split()
score2 = doc('.board-item-main .movie-item-number.score-num .fraction').text().split()
score = [score1[i]+score2[i] for i in range(0, len(score1))]
import ctypes # An included library with Python install.
ctypes.windll.user32.MessageBoxW(0, "Your text", "Your title", 1)
Or define a function (Mbox) like so:
import ctypes # An included library with Python install.
def Mbox(title, text, style):
return ctypes.windll.user32.MessageBoxW(0, text, title, style)
Mbox('Your title', 'Your text', 1)
Note: edited to use MessageBoxW instead of MessageBoxA
Note the styles are as follows:
## Styles:
## 0 : OK
## 1 : OK | Cancel
## 2 : Abort | Retry | Ignore
## 3 : Yes | No | Cancel
## 4 : Yes | No
## 5 : Retry | Cancel
## 6 : Cancel | Try Again | Continue
Importing EasyGui
In order to use EasyGui, you must import it. The simplest import statement is:
import easygui
If you use this form, then to access the EasyGui functions, you must prefix them with the name “easygui”, this way:
easygui.msgbox(...)
One alternative is to import EasyGui this way:
from easygui import *
This makes it easier to invoke the EasyGui functions; you won’t have to prefix the function names with “easygui”. You can just code something like this:
msgbox(...)
A third alternative is to use something like the following import statement:
import easygui as g
This allows you to keep the EasyGui namespace separate with a minimal amount of typing. You can access easgui functions like this:
g.msgbox(...)
This third alterative is actually the best way to do it once you get used to python and easygui.
Using EasyGui
Once your module has imported EasyGui, GUI operations are a simple a matter of invoking EasyGui functions with a few parameters. For example, using EasyGui, the famous “Hello, world!” program looks like this:
from easygui import *
msgbox("Hello, world!")
To see a demo of what EasyGui output looks like, invoke easyGui from the command line, this way:
python easygui.py
To see examples of code that invokes the EasyGui functions, look at the demonstration code at the end of easygui.py.
Default arguments for EasyGui functions
For all of the boxes, the first two arguments are for message and title, in that order. In some cases, this might not be the most user-friendly arrangement (for example, the dialogs for getting directory and filenames ignore the message argument), but I felt that keeping this consistent across all widgets was a consideration that is more important.
Most arguments to EasyGui functions have defaults.
Almost all of the boxes display a message and a title. The title defaults to the empty string, and the message usually has a simple default.
This makes it is possible to specify as few arguments as you need in order to get the result that you want. For instance, the title argument to msgbox is optional, so you can call msgbox specifying only a message, this way:
msgbox("Danger, Will Robinson!")
or specifying a message and a title, this way:
msgbox("Danger, Will Robinson!", "Warning!")
On the various types of buttonbox, the default message is “Shall I continue?”, so you can (if you wish) invoke them without arguments at all. Here we invoke ccbox (the close/cancel box, which returns a boolean value) without any arguments at all:
if ccbox():
pass # user chose to continue else:
return # user chose to cancel
Using keyword arguments when calling EasyGui functions
It is possible to use keyword arguments when calling EasyGui functions.
Suppose for instance that you wanted to use a buttonbox, but
(for whatever reason) did not want to specify the title (second) positional argument. You could still specify the choices argument (the third argument)
using a keyword, this way:
choices = ["Yes","No","Only on Friday"]
reply = choicebox("Do you like to eat fish?", choices=choices)
Using buttonboxes
There are a number of functions built on top of buttonbox() for common needs.
msgbox
msgbox displays a message and offers an OK button. You can send whatever message you want, along with whatever title you want. You can even over-ride the default text of “OK” on the button if you wish. Here is the signature of the msgbox function:
def msgbox(msg="(Your message goes here)", title=", ok_button="OK"):
....
The clearest way to over-ride the button text is to do it with a keyword argument, like this:
msgbox("Backup complete!", ok_button="Good job!")
Here are a couple of examples:
msgbox("Hello, world!")
msg = "Do you want to continue?"
title = "Please Confirm"
if ccbox(msg, title): # show a Continue/Cancel dialog
pass # user chose Continue else: # user chose Cancel
sys.exit(0)
ccbox
ccbox offers a choice of Continue and Cancel, and returns either True (for continue) or False (for cancel).
ynbox
ynbox offers a choice of Yes and No, and returns either True of False.
buttonbox
To specify your own set of buttons in a buttonbox, use the buttonbox() function.
The buttonbox can be used to display a set of buttons of your choice. When the user clicks on a button, buttonbox() returns the text of the choice. If the user cancels or closes the buttonbox, the default choice (the first choice) is returned.
buttonbox displays a message, a title, and a set of buttons. Returns the text of the button that the user selected.
indexbox
indexbox displays a message, a title, and a set of buttons. Returns the index of the user’s choice. For example, if you invoked index box with three choices (A, B, C), indexbox would return 0 if the user picked A, 1 if he picked B, and 2 if he picked C.
boolbox
boolbox (boolean box) displays a message, a title, and a set of buttons. Returns returns 1 if the first button is chosen. Otherwise returns 0.
Here is a simple example of a boolbox():
message = "What does she say?"
title = "
if boolbox(message, title, ["She loves me", "She loves me not"]):
sendher("Flowers") # This is just a sample function that you might write.
else:
pass
How to show an image in a buttonbox
When you invoke the buttonbox function (or other functions that display a button box, such as msgbox, indexbox, ynbox,
etc.), you can specify the keyword argument image=xxx where xxx is the filename of an image. The file can be .gif.
Usually, you can use other image formats such as .png.
Note
The types of files supported depends on how you installed python. If other formats don’t work, you may need to install the PIL library.
If an image argument is specified, the image file will be displayed after the message.
Here is some sample code from EasyGui’s demonstration routine:
image = "python_and_check_logo.gif"
msg = "Do you like this picture?"
choices = ["Yes","No","No opinion"]
reply = buttonbox(msg, image=image, choices=choices)
If you click on one of the buttons on the bottom, its value will be returned in ‘reply’. You may also click on the image.
In that case, the image filename is returned.
Letting the user select from a list of choices
choicebox
Buttonboxes are good for offering the user a small selection of short choices. But if there are many choices, or the text of the choices is long, then a better strategy is to present them as a list.
choicebox provides a way for a user to select from a list of choices. The choices are specified in a sequence (a tuple or a list). The choices will be given a case-insensitive sort before they are presented.
The keyboard can be used to select an element of the list.
Pressing “g” on the keyboard, for example, will jump the selection to the first element beginning with “g”. Pressing “g” again, will jump the cursor to the next element beginning with “g”. At the end of the elements beginning with “g”, pressing “g” again will cause the selection to wrap around to the beginning of the list and jump to the first element beginning with “g”.
If there is no element beginning with “g”, then the last element that occurs before the position where “g” would occur is selected. If there is no element before “g”, then the first element in the list is selected:
msg ="What is your favorite flavor?"
title = "Ice Cream Survey"
choices = ["Vanilla", "Chocolate", "Strawberry", "Rocky Road"]
choice = choicebox(msg, title, choices)
Another example of a choicebox:
multchoicebox
The multchoicebox() function provides a way for a user to select from a list of choices. The interface looks just like the choicebox, but the user may select zero, one, or multiple choices.
The choices are specified in a sequence (a tuple or a list). The choices will be given a case-insensitive sort before they are presented.
Letting the user enter information
enterbox
enterbox is a simple way of getting a string from the user
integerbox
integerbox is a simple way of getting an integer from the user.
multenterbox
multenterbox is a simple way of showing multiple enterboxes on a single screen.
In the multenterbox:
If there are fewer values than names, the list of values is padded with empty strings until the number of values is the same as the number of names.
If there are more values than names, the list of values is truncated so that there are as many values as names.
Returns a list of the values of the fields, or None if the user cancels the operation.
Here is some example code, that shows how values returned from multenterbox can be checked for validity before they are accepted:
from __future__ import print_function msg = "Enter your personal information"
title = "Credit Card Application"
fieldNames = ["Name", "Street Address", "City", "State", "ZipCode"]
fieldValues = multenterbox(msg, title, fieldNames)
if fieldValues is None:
sys.exit(0)
# make sure that none of the fields were left blank while 1:
errmsg = "
for i, name in enumerate(fieldNames):
if fieldValues[i].strip() == ":
errmsg += "{} is a required field.\n\n".format(name)
if errmsg == ":
break # no problems found
fieldValues = multenterbox(errmsg, title, fieldNames, fieldValues)
if fieldValues is None:
break print("Reply was:{}".format(fieldValues))
Note
The first line ‘from __future__’ is only necessary if you are using Python 2.*, and is only needed for this demo.
Letting the user enter password information
passwordbox
A passwordbox box is like an enterbox, but used for entering passwords. The text is masked as it is typed in.
multpasswordbox
multpasswordbox has the same interface as multenterbox, but when it is displayed, the last of the fields is assumed to be a password, and is masked with asterisks.
Displaying text
EasyGui provides functions for displaying text.
textbox
The textbox() function displays text in a proportional font. The text will word-wrap.
codebox
The codebox() function displays text in a monospaced font and does not wrap.
Note that you can pass codebox() and textbox() either a string or a list of strings. A list of strings will be converted to text before being displayed. This means that you can use these functions to display the contents of a file this way:
import os filename = os.path.normcase("c:/autoexec.bat")
f = open(filename, "r")
text = f.readlines()
f.close()
codebox("Contents of file " + filename, "Show File Contents", text)
Working with files
A common need is to ask the user for a filename or for a directory. EasyGui provides a few basic functions for allowing a user to navigate through the file system and choose a directory or a file. (These functions are wrappers around widgets and classes in lib-tk.)
Note that in the current version of EasyGui, the startpos argument is not supported.
diropenbox
diropenbox returns the name of a directory
fileopenbox
fileopenbox returns the name of a file
filesavebox
filesavebox returns the name of a file
Remembering User Settings
EgStore
A common need is to ask the user for some setting, and then to “persist it”, or store it on disk, so that the next time the user uses your application, you can remember his previous setting.
In order to make the process of storing and restoring user settings, EasyGui provides a class called EgStore. In order to remember some settings, your application must define a class (let’s call it Settings, although you can call it anything you want) that inherits from EgStore.
Your application must also create an object of that class (let’s call the object settings).
The constructor (the __init__ method) of the Settings class can initialize all of the values that you wish to remember.
Once you have done this, you can remember the settings simply by assigning values to instance variables in the settings object, and use the settings.store() method to persist the settings object to disk.
Here is an example of code using the Settings class:
from easygui import EgStore
# -----------------------------------------------------------------------
# define a class named Settings as a subclass of EgStore
# -----------------------------------------------------------------------
class Settings(EgStore):
def __init__(self, filename): # filename is required
# -------------------------------------------------
# Specify default/initial values for variables that
# this particular application wants to remember.
# -------------------------------------------------
self.userId = "
self.targetServer = "
# -------------------------------------------------
# For subclasses of EgStore, these must be
# the last two statements in __init__
# -------------------------------------------------
self.filename = filename # this is required
self.restore()
# Create the settings object.
# If the settingsFile exists, this will restore its values
# from the settingsFile.
# create "settings", a persistent Settings object
# Note that the "filename" argument is required.
# The directory for the persistent file must already exist.
settingsFilename = "settings.txt"
settings = Settings(settingsFilename)
# Now use the settings object.
# Initialize the "user" and "server" variables
# In a real application, we'd probably have the user enter them via enterbox user = "obama_barak"
server = "whitehouse1"
# Save the variables as attributes of the "settings" object settings.userId = user settings.targetServer = server settings.store() # persist the settings print("\nInitial settings")
print settings
# Run code that gets a new value for userId
# then persist the settings with the new value user = "biden_joe"
settings.userId = user settings.store()
print("\nSettings after modification")
print settings
# Delete setting variable del settings.userId print("\nSettings after deletion of userId")
print settings
Here is an example of code using a dedicated function to create the Settings class:
from easygui import read_or_create_settings
# Create the settings object.
settings = read_or_create_settings('settings1.txt')
# Save the variables as attributes of the "settings" object settings.userId = "obama_barak"
settings.targetServer = "whitehouse1"
settings.store() # persist the settings print("\nInitial settings")
print settings
# Run code that gets a new value for userId
# then persist the settings with the new value user = "biden_joe"
settings.userId = user settings.store()
print("\nSettings after modification")
print settings
# Delete setting variable del settings.userId print("\nSettings after deletion of userId")
print settings
Trapping Exceptions
exceptionbox
Sometimes exceptions are raised… even in EasyGui applications. Depending on how you run your application, the stack trace might be thrown away, or written to stdout while your application crashes.
EasyGui provides a better way of handling exceptions via exceptionbox. Exceptionbox displays the stack trace in a codebox and may allow you to continue processing.
Exceptionbox is easy to use. Here is a code example:
try:
someFunction() # this may raise an exception except:
exceptionbox()
Create a package for Android
You can create a package for android using the python-for-android project.
This page explains how to download and use it directly on your own machine (see Packaging with python-for-android) or use the Buildozer tool to automate the entire process.
You can also see Packaging your application for the Kivy Launcher to run kivy programs without compiling them.
For new users, we recommend using Buildozer as the easiest way to make a full APK.
You can also run your Kivy app without a compilation step with the Kivy Launcher app.
Kivy applications can be released on an Android market such as the Play store, with a few extra steps to create a fully signed APK.
The Kivy project includes tools for accessing Android APIs to accomplish vibration, sensor access, texting etc.
These, along with information on debugging on the device, are documented at the
main Android page.
Buildozer
Buildozer is a tool that automates the entire build process.
It downloads and sets up all the prequisites for python-for-android,
including the android SDK and NDK, then builds an apk that can be automatically pushed to the device.
Buildozer currently works only in Linux, and is a beta release, but it already works well and can significantly simplify the apk build.
You can get buildozer at https://github.com/kivy/buildozer:
git clone https://github.com/kivy/buildozer.git cd buildozer sudo python setup.py install
This will install buildozer in your system.
Afterwards, navigate to your project directory and run:
buildozer init
This creates a buildozer.spec file controlling your build configuration.
You should edit it appropriately with your app name etc.
You can set variables to control most or all of the parameters passed to python-for-android.
Install buildozer’s dependencies.
Finally, plug in your android device and run:
buildozer android debug deploy run
to build, push and automatically run the apk on your device.
Buildozer has many available options and tools to help you, the steps above are just the simplest way to build and run your APK.
The full documentation is available here.
You can also check the Buildozer README at https://github.com/kivy/buildozer.
Packaging with python-for-android
You can also package directly with python-for-android, which can give you more control but requires you to manually download parts of the Android toolchain.
See the python-for-android documentation
for full details.
Packaging your application for the Kivy Launcher
The Kivy launcher
is an Android application that runs any Kivy examples stored on your SD Card.
To install the Kivy launcher, you must:
Go to the Kivy Launcher page
on the Google Play Store
Click on Install
Select your phone… And you’re done!
If you don’t have access to the Google Play Store on your phone/tablet,
you can download and install the APK manually from http://kivy.org/#download.
Once the Kivy launcher is installed, you can put your Kivy applications in the Kivy directory in your external storage directory
(often available at /sdcard even in devices where this memory is internal), e.g.
/sdcard/kivy/<yourapplication>
<yourapplication> should be a directory containing:
# Your main application file:
main.py
# Some info Kivy requires about your app on android:
android.txt
The file android.txt must contain:
title=<Application Title>
author=<Your Name>
orientation=<portrait|landscape>
These options are just a very basic configuration.
If you create your own APK using the tools above, you can choose many other settings.
Installation of Examples
Kivy comes with many examples, and these can be a great place to start trying the Kivy launcher.
You can run them as below:
#.
Download the `Kivy demos for Android <https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/kivy/kivydemo-for-android.zip>`_
#.
Unzip the contents and go to the folder `kivydemo-for-android`
#.
Copy all the the subfolders here to
/sdcard/kivy
Run the launcher and select one of the Pictures, Showcase, Touchtracer, Cymunk or other demos…
Release on the market
If you have built your own APK with Buildozer or with python-for-android, you can create a release version that may be released on the Play store or other Android markets.
To do this, you must run Buildozer with the release parameter
(e.g. buildozer android release), or if using python-for-android use the --release option to build.py.
This creates a release APK in the bin directory, which you must properly sign and zipalign.
The procedure for doing this is described in the Android documentation at https://developer.android.com/studio/publish/app-signing.html#signing-manually -
all the necessary tools come with the Android SDK.
Targeting Android
Kivy is designed to operate identically across platforms and as a result, makes some clear design decisions.
It includes its own set of widgets and by default,
builds an APK with all the required core dependencies and libraries.
It is possible to target specific Android features, both directly and in a (somewhat) cross-platform way.
See the Using Android APIs section of the Kivy on Android documentation for more details.
qsort = lambda l: l if len(l) <= 1 else qsort([x for x in l[1:] if x < l[0]]) + [l[0]] + qsort([x for x in l[1:] if x >= l[0]])
print(qsort([17, 29, 11, 97, 103, 5]))
# [5, 11, 17, 29, 97, 103]
8、n个连续数的和
n = 10
print(sum(range(0, n+1)))
# 55
9、交换两个变量的值
a,b = b,a
10、斐波纳契数列
fib = lambda x: x if x<=1 else fib(x-1) + fib(x-2)
print(fib(20))
# 6765
11、将嵌套列表合并为一个列表
main_list = [[0, 1, 2], [11, 12, 13], [52, 53, 54]]
result = [item for sublist in main_list for item in sublist]
print(result)
>
[0, 1, 2, 11, 12, 13, 52, 53, 54]
old_list = [[1, 2, 3], [3, 4, 6], [5, 6, 7]]
result = list(list(x) for x in zip(*old_list))
print(result)
# [[1, 3, 5], [2, 4, 6], [3, 6, 7]]
49、列表过滤
result = list(filter(lambda x: x % 2 == 0, [1, 2, 3, 4, 5, 6]))
print(result)
# [2, 4, 6]
50、解包
a, *b, c = [1, 2, 3, 4, 5]
print(a) # 1
print(b) # [2, 3, 4]
print(c) # 5
Web Scrapingwith Mechanical Soup
Web Scraping Databases with Mechanical Soup and SQlite
import mechanicalsoup
import pandas as pd
import sqlite3
# create browser object & open URL
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://en.wikipedia.org/wiki/Comparison_of_Linux_distributions")
# extract all table headers (entire "Distribution" column)
th = browser.page.find_all("th", attrs={"class": "table-rh"})
# tidy up and slice off non-table elements
distribution = [value.text.replace("\n", "") for value in th]
distribution = distribution[:95]
# extract table data (the rest of the table)
td = browser.page.find_all("td")
# tidy up and slice off non-table elements
columns = [value.text.replace("\n", "") for value in td]
columns = columns[6:1051]
column_names = ["Founder",
"Maintainer",
"Initial_Release_Year",
"Current_Stable_Version",
"Security_Updates",
"Release_Date",
"System_Distribution_Commitment",
"Forked_From",
"Target_Audience",
"Cost",
"Status"]
dictionary = {"Distribution": distribution}
# insert column names and their data into a dictionary
for idx, key in enumerate(column_names):
dictionary[key] = columns[idx:][::11]
# convert dictionary to data frame
df = pd.DataFrame(data = dictionary)
# create new database and cursor
connection = sqlite3.connect("linux_distro.db")
cursor = connection.cursor()
# create database table and insert all data frame rows
cursor.execute("create table linux (Distribution, " + ",".join(column_names)+ ")")
for i in range(len(df)):
cursor.execute("insert into linux values (?,?,?,?,?,?,?,?,?,?,?,?)", df.iloc[i])
# PERMANENTLY save inserted data in "linux_distro.db"
connection.commit()
connection.close()
GUI 神器
transform aommand line applications into GUITurnPython command line program into a GUI application
GUI是一个人机交互的界面,换句话说,它是人类与计算机交互的一种方法。
GUI主要使用窗口,图标和菜单,也可以通过鼠标和键盘进行操作。
GUI库包含部件。
部件是一系列图形控制元素的集合。
在构建GUI程序时,通常使用层叠方式。
众多图形控制元素直接叠加起来。
当使用python编写应用程序时,你就必须使用GUI库来完成。
对于Python GUI库,你可以有很多的选择。
最多的是 Tkinter ,这个 GUI 库比较灵活,可以做出比较复杂的界面。
但是在页面布局和控件使用上比较复杂,想画出一个好看的界面还是要花很多功夫的。
今天介绍一个 GUI 库 —— Gooey ,一行代码就可以快速生成 GUI 应用程序。
上面我们了解了浏览器的一些初始化设置和基本的操作实例,下面我们再对一些常用的操作 API 进行说明。
常见的一些 API 如点击 click,输入 fill 等操作,这些方法都是属于 Page 对象的,所以所有的方法都从 Page 对象的 API 文档查找就好了,文档地址:
https://playwright.dev/python/docs/api/class-page。
下面介绍几个常见的 API 用法。
事件监听
Page 对象提供了一个 on 方法,它可以用来监听页面中发生的各个事件,比如 close、console、load、request、response 等等。
比如这里我们可以监听 response 事件,response 事件可以在每次网络请求得到响应的时候触发,我们可以设置对应的回调方法获取到对应 Response 的全部信息,示例如下:
from playwright.sync_api import sync_playwright
def on_response(response):
print(f'Statue {response.status}: {response.url}')
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.on('response', on_response)
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
browser.close()
这里我们在创建 Page 对象之后,就开始监听 response 事件,同时将回调方法设置为 on_response,on_response 对象接收一个参数,然后把 Response 的状态码和链接都输出出来了。
运行之后,可以看到控制台输出结果如下:
Statue 200: https://spa6.scrape.center/
Statue 200: https://spa6.scrape.center/css/app.ea9d802a.css
Statue 200: https://spa6.scrape.center/js/app.5ef0d454.js
Statue 200: https://spa6.scrape.center/js/chunk-vendors.77daf991.js
Statue 200: https://spa6.scrape.center/css/chunk-19c920f8.2a6496e0.css
...
Statue 200: https://spa6.scrape.center/css/chunk-19c920f8.2a6496e0.css
Statue 200: https://spa6.scrape.center/js/chunk-19c920f8.c3a1129d.js
Statue 200: https://spa6.scrape.center/img/logo.a508a8f0.png
Statue 200: https://spa6.scrape.center/fonts/element-icons.535877f5.woff
Statue 301: https://spa6.scrape.center/api/movie?limit=10&offset=0&token=NGMwMzFhNGEzMTFiMzJkOGE0ZTQ1YjUzMTc2OWNiYTI1Yzk0ZDM3MSwxNjIyOTE4NTE5
Statue 200: https://spa6.scrape.center/api/movie/?limit=10&offset=0&token=NGMwMzFhNGEzMTFiMzJkOGE0ZTQ1YjUzMTc2OWNiYTI1Yzk0ZDM3MSwxNjIyOTE4NTE5
Statue 200: https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@464w_644h_1e_1c
Statue 200: https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@464w_644h_1e_1c
....
Statue 200: https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@464w_644h_1e_1c
“注意:
这里省略了部分重复的内容。
”可以看到,这里的输出结果其实正好对应浏览器 Network 面板中所有的请求和响应内容,和下图是一一对应的:
这里我们调用了 route 方法,第一个参数通过正则表达式传入了匹配的 URL 路径,这里代表的是任何包含
.png
或
.jpg
的链接,遇到这样的请求,会回调 cancel_request 方法处理,cancel_request 方法可以接收两个参数,一个是 route,代表一个 CallableRoute 对象,另外一个是 request,代表 Request 对象。
这里我们直接调用了 route 的 abort 方法,取消了这次请求,所以最终导致的结果就是图片的加载全部取消了。
观察下运行结果,如图所示:
这里我们使用 route 的 fulfill 方法指定了一个本地文件,就是刚才我们定义的 HTML 文件,运行结果如下:
Playwright is a browser automation library very similar to Puppeteer.
Both allow you to control a web browser with only a few lines of code.
The possibilities are endless.
From automating mundane tasks and testing web applications to data mining.
With Playwright you can run Firefox and Safari (WebKit), not only Chromium based browsers.
It will also save you time, because Playwright automates away repetitive code, such as waiting for buttons to appear in the page.
You don’t need to be familiar with Playwright, Puppeteer or web scraping to enjoy this tutorial, but knowledge of HTML, CSS and JavaScript is expected.
In this tutorial you’ll learn how to:
Start a browser with PlaywrightClick buttons and wait for actionsExtract data from a website
The Project
To showcase the basics of Playwright, we will create a simple scraper that extracts data about GitHub Topics.
You’ll be able to select a topic and the scraper will return information about repositories tagged with this topic.
The page for JavaScript GitHub Topic
We will use Playwright to start a browser, open the GitHub topic page, click the Load more button to display more repositories, and then extract the following information:
Owner
Name
URL
Number of stars
Description
List of repository topics
Installation
To use Playwright you’ll need Node.js version higher than 10 and a package manager.
We’ll use npm, which comes preinstalled with Node.js.
You can confirm their existence on your machine by running:
node -v && npm -v
If you’re missing either Node.js or NPM, visit the installation tutorial to get started.
Now that we know our environment checks out, let’s create a new project and install Playwright.
mkdir playwright-scraper && cd playwright-scraper
npm init -y
npm i playwright
The first time you install Playwright, it will download browser binaries, so the installation may take a bit longer.
Building a scraper
Creating a scraper with Playwright is surprisingly easy, even if you have no previous scraping experience.
If you understand JavaScript and CSS, it will be a piece of cake.
In your project folder, create a file called scraper.js (or choose any other name) and open it in your favorite code editor.
First, we will confirm that Playwright is correctly installed and working by running a simple script.
Now run it using your code editor or by executing the following command in your project folder.
node scraper.js
If you saw a Chromium window open and the GitHub Topics page successfully loaded, congratulations, you just robotized your web browser with Playwright!
JavaScript GitHub topic
Loading more repositories
When you first open the topic page, the number of displayed repositories is limited to 30.
You can load more by clicking the Load more… button at the bottom of the page.
There are two things we need to tell Playwright to load more repositories:
Click the Load more… button.
Wait for the repositories to load.
Clicking buttons is extremely easy with Playwright.
By prefixing text= to a string you’re looking for, Playwright will find the element that includes this string and click it.
It will also wait for the element to appear if it’s not rendered on the page yet.
Clicking a button
This is a huge improvement over Puppeteer and it makes Playwright lovely to work with.
After clicking, we need to wait for the repositories to load.
If we didn’t, the scraper could finish before the new repositories show up on the page and we would miss that data.
page.waitForFunction() allows you to execute a function inside the browser and wait until the function returns true .
Waiting for
To find that article.border selector, we used browser Dev Tools, which you can open in most browsers by right-clicking anywhere on the page and selecting Inspect.
It means: Select the <article> tag with the border class.
Chrome Dev Tools
Let’s plug this into our code and do a test run.
If you watch the run, you’ll see that the browser first scrolls down and clicks the Load more… button, which changes the text into Loading more.
After a second or two, you’ll see the next batch of 30 repositories appear.
Great job!
Extracting data
Now that we know how to load more repositories, we will extract the data we want.
To do this, we’ll use the page.$$eval function.
It tells the browser to find certain elements and then execute a JavaScript function with those elements.
Extracting data from page
It works like this: page.$$evalfinds our repositories and executes the provided function in the browser.
We get repoCards which is an Array of all the repo elements.
The return value of the function becomes the return value of the page.$$eval call.
Thanks to Playwright, you can pull data out of the browser and save them to a variable in Node.js.
Magic!
If you’re struggling to understand the extraction code itself, be sure to check out this guide on working with CSS selectors and this tutorial on using those selectors to find HTML elements.
And here’s the code with extraction included.
When you run it, you’ll see 60 repositories with their information printed to the console.
Conclusion
In this tutorial we learned how to start a browser with Playwright, and control its actions with some of Playwright’s most useful functions: page.click() to emulate mouse clicks, page.waitForFunction() to wait for things to happen and page.$$eval() to extract data from a browser page.
But we’ve only scratched the surface of what’s possible with Playwright.
You can log into websites, fill forms, intercept network communication, and most importantly, use almost any browser in existence.
Where will you take this project next? How about turning it into a command-line interface (CLI) tool that takes a topic and number of repositories on input and outputs a file with the repositories? You can do it now.
Python - Command Line Arguments
Python provides a getopt module that helps you parse command-line options and arguments.
$ python test.py arg1 arg2 arg3
The Python sys module provides access to any command-line arguments via the sys.argv. This serves two purposes −
sys.argv is the list of command-line arguments.
len(sys.argv) is the number of command-line arguments.
Here sys.argv[0] is the program ie. script name.
Example
Consider the following script test.py −
#!/usr/bin/python
import sys
print( 'Number of arguments:', len(sys.argv), 'arguments.')
print( 'Argument List:', str(sys.argv))
Now run above script as follows −
$ python test.py arg1 arg2 arg3
This produce following result −
Number of arguments: 4 arguments.
Argument List: ['test.py', 'arg1', 'arg2', 'arg3']
NOTE − As mentioned above, first argument is always script name and it is also being counted in number of arguments.
Parsing Command-Line Arguments
Python provided a getopt module that helps you parse command-line options and arguments. This module provides two functions and an exception to enable command line argument parsing.
getopt.getopt method
This method parses command line options and parameter list. Following is simple syntax for this method −
getopt.getopt(args, options, [long_options])
Here is the detail of the parameters −
args − This is the argument list to be parsed.
options − This is the string of option letters that the script wants to recognize, with options that require an argument should be followed by a colon (:).
long_options − This is optional parameter and if specified, must be a list of strings with the names of the long options, which should be supported. Long options, which require an argument should be followed by an equal sign ('='). To accept only long options, options should be an empty string.
This method returns value consisting of two elements: the first is a list of (option, value) pairs. The second is the list of program arguments left after the option list was stripped.
Each option-and-value pair returned has the option as its first element, prefixed with a hyphen for short options (e.g., '-x') or two hyphens for long options (e.g., '--long-option').
Exception getopt.GetoptError
This is raised when an unrecognized option is found in the argument list or when an option requiring an argument is given none.
The argument to the exception is a string indicating the cause of the error. The attributes msg and opt give the error message and related option.
Example
Consider we want to pass two file names through command line and we also want to give an option to check the usage of the script. Usage of the script is as follows −
usage: test.py -i <inputfile> -o <outputfile>
Here is the following script to test.py −
#!/usr/bin/python
import sys, getopt
def main(argv):
inputfile = ''
outputfile = ''
try:
opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
except getopt.GetoptError:
print( 'test.py -i <inputfile> -o <outputfile>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print( 'test.py -i <inputfile> -o <outputfile>')
sys.exit()
elif opt in ("-i", "--ifile"):
inputfile = arg
elif opt in ("-o", "--ofile"):
outputfile = arg
print( 'Input file is "', inputfile)
print( 'Output file is "', outputfile)
Now, run above script as follows −
$ test.py -h
usage: test.py -i <inputfile> -o <outputfile>
$ test.py -i BMP -o
usage: test.py -i <inputfile> -o <outputfile>
$ test.py -i inputfile
Input file is " inputfile
Output file is "
process command line arguments
import sys
print("\n".join(sys.argv))
sys.argv is a list that contains all the arguments passed to the script on the command line.
sys.argv[0] is the script name.
import sys
print(sys.argv[1:])
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_argument("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
args = parser.parse_args()
streamlit
https://hackernoon.com/how-to-use-streamlit-and-python-to-build-a-data-science-app
Use Streamlit and Python to Build a Data Science App
https://github.com/streamlit/streamlit
streamlit
from sklearn.feature_selection import RFE,RFECV, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso,LarsCV)
from stability_selection import StabilitySelection, RandomizedLasso
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVR
线性回归系数大小排序
回归系数(regression coefficient)在回归方程中表示自变量 x 对因变量 y 影响大小的参数。
回归系数越大表示 x 对 y 影响越大。
This automation script scrapes the article content from medium and then reads it loud and clear.
If you change the script a little bit then it can be used to read articles from other websites too.
I use this script when I am not in the mood to read but to listen.
Libraries:-Beautiful Soup is a Python package for parsing HTML and XML documents.
requests Let’s You Establish a Connection Between Client and Server With Just One Line of Code.
Pyttsx3, converts text into speech, with control over rate, frequency, and voice.
import pyttsx3
import requests
from bs4 import BeautifulSoup
engine = pyttsx3.init('sapi5')
voices = engine.getProperty('voices')
newVoiceRate = 130 ## Reduce The Speech Rate
engine.setProperty('rate',newVoiceRate)
engine.setProperty('voice', voices[1].id)
def speak(audio):
engine.say(audio)
engine.runAndWait()
text = str(input("Paste article\n"))
res = requests.get(text)
soup = BeautifulSoup(res.text,'html.parser')
articles = []
for i in range(len(soup.select('.p'))):
article = soup.select('.p')[i].getText().strip()
articles.append(article)
text = " ".join(articles)
speak(text)
# engine.save_to_file(text, 'test.mp3') ## If you want to save the speech as a audio file
engine.runAndWait()
Script Applications:-
AudioBooks
Read Wikipedia Articles
Q&A Bots
One-Click Sketching
I just love this script.
It lets you convert your amazing images into a pencil sketch with a few lines of code.
You can use this script to impress someone by gifting them their pencil sketch.
Libraries:-
Opencv, is a python library that is designed to solve Computer Vision problems.
It has many inbuilt methods to perform the biggest tasks in fewer lines of code.
""" Photo Sketching Using Python """
import cv2
img = cv2.imread("elon.jpg")
## Image to Gray Image
gray_image = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
## Gray Image to Inverted Gray Image
inverted_gray_image = 255-gray_image
## Blurring The Inverted Gray Image
blurred_inverted_gray_image = cv2.GaussianBlur(inverted_gray_image, (19,19),0)
## Inverting the blurred image
inverted_blurred_image = 255-blurred_inverted_gray_image
### Preparing Photo sketching
sketck = cv2.divide(gray_image, inverted_blurred_image,scale= 256.0)
cv2.imshow("Original Image",img)
cv2.imshow("Pencil Sketch", sketck)
cv2.waitKey(0)
Result — Image By Author
Script Applications:-
Building OCR Software
Detecting Number Plate
Detecting Edges, Creating Funky Images
Stay Up With Top Headlines
Everyone wants to stay up to date with the latest and trending news of your country.
This automation script can do the work for you.
It uses an external API to extract all the trending news of your country, state, city, etc.
This script increases productivity and knowledge.
The external API that is used in the script is news API by google.
It offers the latest and trending news, different articles about a particular topic like tesla, business headlines, articles published by a journal, trending news between a timeline, etc.
Libraries:-
Pyttsx3 is a text-to-speech Library In Python.
& Requests.
import pyttsx3
import requests
engine = pyttsx3.init('sapi5')
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[0].id)
def speak(audio):
engine.say(audio)
engine.runAndWait()
def trndnews():
url = " http://newsapi.org/v2/top-headlines?country=us&apiKey=GET_YOUR_OWN"
page = requests.get(url).json()
article = page["articles"]
results = []
for ar in article:
results.append(ar["title"])
for i in range(len(results)):
print(i + 1, results[i])
speak(results)
trndnews()
Script Applications:-
ML Fake News Detection.
Stocks Updates On The Start
Buying and selling stocks is one of the trendiest ways of earning money nowadays.
A stock known as equity represents the ownership of a fraction of a corporation.
This automation script will give you the stock price of stock whenever you open your desktop.
Also with the same script, you can generate past years' data of the stock for better knowledge of the stock.
To Run This Script On The Start, You Can Simply Add it to the window startup folder.
Just Press win+r and then type shell:startup paste your script there.
Libraries:- Pyfinance, yahoo_fin
''' Live price of The Stock '''
from yahoo_fin import stock_info
live_price = stock_info.get_live_price("TSLA")
print(round(live_price,2)," USD")
''' Stock Price From 2019 to 2021 '''
import yfinance as yf
stockSymbol = 'TSLA'
stockData = yf.Ticker(stockSymbol)
stockDf_past_2 = stockData.history(period='5d', start='2019-1-1', end='2021-12-31')
print(stockDf_past_2)
Script Applications:-
This Script Can Be Used For Creating Algo Trading Bots, Stock Analysis, Researches, etc.
Bulk Email Sender
In My Previous Article About Automation Scripts, I talked about how you can automate sending emails with attachments.
This automation script is a level up to that script.
It allows you to send multiple emails at a time with the same or different data, and messages.
Libraries:-
Email, is a python library that is used to manage emails.
Smtlib, defines a session object over which we can send emails and files.
Pandas, Reading the CSV or Excel file.
import smtplib
from email.message import EmailMessage
import pandas as pd
def send_email(remail, rsubject, rcontent):
email = EmailMessage() ## Creating a object for EmailMessage
email['from'] = 'The Pythoneer Here' ## Person who is sending
email['to'] = remail ## Whom we are sending
email['subject'] = rsubject ## Subject of email
email.set_content(rcontent) ## content of email
with smtplib.SMTP(host='smtp.gmail.com',port=587)as smtp:
smtp.ehlo() ## server object
smtp.starttls() ## used to send data between server and client
smtp.login(SENDER_EMAIL,SENDER_PSWRD) ## login id and password of gmail
smtp.send_message(email) ## Sending email
print("email send to ",remail) ## Printing success message
if __name__ == '__main__':
df = pd.read_excel('list.xlsx')
length = len(df)+1
for index, item in df.iterrows():
email = item[0]
subject = item[1]
content = item[2]
send_email(email,subject,content)
Script Applications:-
Can Be Used For Sending Newsletters.
Stay Connected With All Your Clients.
Become a Genuine Medium Member With The Cost of One Pizza.
It’s Just 5$ a month.
You Can Use My Referral Link To Become One.
“Don’t Just Read, Support The Writer Too”
No Time For EDA
Eda(exploratory data analysis) refers to the initial investigation done to understand the data more clearly.
It is one of the most important stages of the data science project lifecycle.
It is also referred to as the decision-making stage because, by the output analysis of this stage model, algorithms, parameters, weights everything is chosen.
Anyone who knows a little bit about data science will agree with me that EDA is a time-consuming process.
Well, not anymore.
This automation script used an amazing library Dtale and generate a quick summary report of the data given to it with just one line of code.
There are also many similar libraries that can also generate a quick summary like Dtale for example Autoviz, Sweetviz, etc.
import seaborn as sns
### Printing Inbuilt Datasets of Seaborn Library
print(sns.get_dataset_names())
### Loading Titanic Dataset
df=sns.load_dataset('titanic')
### Importing The Library
import dtale
#### Generating Quick Summary
dtale.show(df)
Script Applications:-
Gives a Quick Review About The Dataset.
Best for beginners.
Smart Login To Different Sites
To prevent yourself from hackers you should always log out from your social media account like Facebook, Twitter, Instagram, etc.
Once you are done with your session.
Entering use id and password each time is not very joyful work to do.
This automation script will log in to different sites for you and once you are done the session is closed automatically.
Libraries:-
Selenium is an open-source web automation tool used for testing and automation.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = 'chromedriver.exe' ##Same Directory as Python Program
driver = webdriver.Chrome(executable_path=PATH)
##### Login Functions
def login_fb(fid,fpsd):
driver.get("https://www.facebook.com/")
def login(id,password):
email = driver.find_element_by_id("email")
email.send_keys(id)
Password = driver.find_element_by_id("pass")
Password.send_keys(password)
button = driver.find_element_by_id("u_0_d_Dw").click()
pass
login(fid,fpsd)
### Like Facebook Write Login Function For Other Platforms Too.
def login_insta():
pass
def login_medium():
pass
def login_twitter():
pass
def login_linkedin():
pass
login_fb("YOUR_LOGIN_ID", "YOUR_PASSWORD")
login_insta()
login_medium()
login_twitter()
login_linkedin()
Related Article
This Automation Script Saves Time, and Increase Productivity.
Be Safe & Watermark Your Images
Internet is filled with digital thieves, who always look for other people’s work to use it as their own without giving proper attribution.
Images are one of the most stoled properties on the internet.
You clicked a masterpiece, upload it on the internet to showcase it to the world and some thief come and stole it and published it with their own name.
To prevent this you should always watermark all images with your unique sign.
This automation script will do the work for you.
Libraries:- OpencvProcess:- We are basically overlaying one image (watermark) on top of another image (original image) with center coordinates.
with little changes and a loop, you can watermark hundreds of images in minutes.
import cv2
watermark = cv2.imread("watermark.png")
img = cv2.imread("no-problem.jpg")
h_img, w_img, _ = img.shape
center_x = int(w_img/2)
center_y = int(h_img/2)
h_watermark, w_watermark, _ = watermark.shape
top_y = center_y - int(h_watermark/2)
left_x = center_x - int(w_watermark/2)
bottom_y = top_y + h_watermark
right_x = left_x + w_watermark
position = img[top_y:bottom_y, left_x:right_x]
result = cv2.addWeighted(position, 1, watermark, 0.5, 0)
img[top_y:bottom_y, left_x:right_x] = result
cv2.imwrite("watermarked_image.jpg", img)
cv2.imshow("Image With Watermark", img)
cv2.waitKey(0)
cv2.destroyAllWindows()
Script Applications:-
Overlaying Two Images.
Image Filtering & Masking.
Remember That
Sometimes when working on a project you get disturbed by some other task that also needs to be done the same day and most of the time you forgot it.
Now anymore, this script will remember everything for you and remind you about it after a certain time as a desktop notification.
Libraries:- win10toast is python library that sends a desktop notification.
from win10toast import ToastNotifier
import time
toaster = ToastNotifier()
header = input("What You Want Me To Remember\n")
text = input("Releated Message\n")
time_min=float(input("In how many minutes?\n"))
time_min = time_min * 60
print("Setting up reminder..")
time.sleep(2)
print("all set!")
time.sleep(time_min)
toaster.show_toast(f"{header}", f"{text}", duration=10, threaded=True)
while toaster.notification_active(): time.sleep(0.005)
Google Scraper
Google is one of the biggest and most used search engines.
There are over 3.8 million searches done per minute around the globe.
Most of them are just queries that get answered on the first result page.
This script will scrape the results from google search and generate answers without even going to actual google.
Libraries:- requests, BeautifulSoup, and Tkinter a GUI library in python.
Process:- At first with the help of Tkinter, a GUI is created that is used to take the query of the user.
Once the user entered the query it is sent to the google scraper function that scrapes the results based on the query and generates the answers.
Then with the help of the showinfo class in Tkinter, the results are shown as a pop-up notification.
from tkinter import *
from tkinter.messagebox import showinfo
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
def action():
### Code For Receiving Query
query=textF.get()
textF.delete(0,END)
print(query)
def google(query):
query = query.replace(" ","+")
try:
url = f'https://www.google.com/search?q={query}&oq={query}&aqs=chrome..69i57j46j69i59j35i39j0j46j0l2.4948j0j7&sourceid=chrome&ie=UTF-8'
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
except:
print("Make sure you have a internet connection")
try:
try:
ans = soup.select('.RqBzHd')[0].getText().strip()
except:
try:
title=soup.select('.AZCkJd')[0].getText().strip()
try:
ans=soup.select('.e24Kjd')[0].getText().strip()
except:
ans=""
ans=f'{title}\n{ans}'
except:
try:
ans=soup.select('.hgKElc')[0].getText().strip()
except:
ans=soup.select('.kno-rdesc span')[0].getText().strip()
except:
ans = "can't find on google"
return ans
result = google(str(query))
showinfo(title="Result For Your Query", message=result)
main = Tk()
main.geometry("300x100")
main.title("Karl")
top = Frame(main)
top.pack(side=TOP)
textF = Entry(main,font=("helvetica",14,"bold"))
textF.focus()
textF.pack(fill=X,pady=5)
textF.insert(0,"Enter your query")
textF.configure(state=DISABLED)
def on_click(event):
textF.configure(state=NORMAL)
textF.delete(0,END)
textF.unbind('<Button-1>',on_click_id)
on_click_id = textF.bind('<Button-1>',on_click)
btn = Button(main,text="Search",font=("Verdana",16),command=action)
btn.pack()
main.mainloop()
Converting PDF To Audio Files
This automation task is one of my favorites.
I use it almost every day.
Here our task is to write a python script that can convert pdfs into audio files.
Libraries:-
PyPDF, is a library in python that is used to read text from a pdf file.
Pyttsx3, is a text-to-speech convert library.
Process:- We first use the PyPDF library to read text from the pdf file and then we convert the text into speech and save it as an audio file.
import pyttsx3,PyPDF2
pdfreader = PyPDF2.PdfFileReader(open('story.pdf','rb'))
speaker = pyttsx3.init()
for page_num in range(pdfreader.numPages):
text = pdfreader.getPage(page_num).extractText() ## extracting text from the PDF
cleaned_text = text.strip().replace('\n',' ') ## Removes unnecessary spaces and break lines
print(cleaned_text) ## Print the text from PDF
#speaker.say(cleaned_text) ## Let The Speaker Speak The Text
speaker.save_to_file(cleaned_text,'story.mp3') ## Saving Text In a audio file 'story.mp3'
speaker.runAndWait()
speaker.stop()
Script Applications:-
Audiobooks.
Storyteller.
By Adding Little Bit of Web Scraping, The Same Script Can Be Used To Read Articles From Sites Like Medium and WordPress.
Playing Random Music From The List
I have a good collection of songs that I love to listen to while working on my projects.
For a music lover like me, this script is very useful.
It randomly picks a song from a folder of songs.
Libraries:-
OS, is a module in python that deals with different tasks related to operating systems Like Opening, deleting, renaming, closing a file, etc.
random, module provides randomness.
Process:- At First With The Help of the OS Module We Detect All The Music Files Inside The Folder and store them in a list, then we generate a random number in the range of length of the folder.
After Generating the random number we use it to run the music file using os.startfile() function.
music_dir = 'G:\\new english songs'
songs = os.listdir(music_dir)
song = random.randint(0,len(songs))
print(songs[song]) ## Prints The Song Name
os.startfile(os.path.join(music_dir, songs[0]))
Script Features:-
Playing Music, Videos.
Can Be Used To Run Random Files Inside a Folder.
No BookMarks Anymore
Every day before going to bed i search the internet to find some good content to read the next day.
Most of the time i bookmark the website or article i came across but day by day my bookmarks have increased so much that now i have over 100+ bookmarks around my browsers.
So i figured out a different way to tackle this problem with the help of python.
Now i copy-paste the link to those websites in a text file and every morning i run my script that opens all those websites again in my browser.
Libraries:-
webbrowser, is a library in python that opens URLs inside the default browser automatically.
Process:- The Process is pretty simple, the script reads different URLs from the files and opens each URL in the browser with the help of a web browser's library.
Getting Wikipedia Information
Wikipedia is a great source of knowledge and information.
This script lets you fetch every information from Wikipedia directly from your command line.
Libraries:-
Wikipedia is a python library that makes parsing data from Wikipedia super easy.
Working:- The Script Will Takes a Query, Parse The Results From Wikipedia For It and Then Speaks The Results Out Loud.
import wikipedia
import pyttsx3
engine = pyttsx3.init('sapi5')
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[0].id)
def speak(audio):
engine.say(audio)
engine.runAndWait()
query = input("What You Want To Ask ??")
results = wikipedia.summary(query, sentences=2)
speak("According to Wikipedia\n")
print(results)
speak(results)
Smart Weather Information
No one wants to get stuck in the rain or heavy snowfall.
Everyone wants to be updated with the weather forecast.
This automation script will send weather information as a desktop notification whenever you opened your pc.
Libraries:-
requests, the library that makes sending HTTP requests simpler and more human-friendly with one single line of code it can establish a connection between client and target server.
Beautiful Soup is a Python package for parsing HTML and XML documents.
ToastNotifier, a python library that sends a desktop notification.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
def weather(city):
city=city.replace(" ","+")
res = requests.get(f'https://www.google.com/search?q={city}&oq={city}&aqs=chrome.0.35i39l2j0l4j46j69i60.6128j1j7&sourceid=chrome&ie=UTF-8',headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
location = soup.select('#wob_loc')[0].getText().strip()
current_time = soup.select('#wob_dts')[0].getText().strip()
info = soup.select('#wob_dc')[0].getText().strip()
weather = soup.select('#wob_tm')[0].getText().strip()
information = f"{location} \n {current_time} \n {info} \n {weather} °C "
toaster = ToastNotifier()
toaster.show_toast("Weather Information",
f"{information}",
duration=10,
threaded=True)
while toaster.notification_active(): time.sleep(0.005)
# print("enter the city name")
# city=input()
city = "London"
city=city+" weather"
weather(city)
Understand The Code Better
Sending Emails With Attachment
As a freelancer every day, I need to send multiple emails that look almost the same with little difference.
This script helps us to send multiple emails at the same time with different names and content.
Libraries:-
Email, is a python library that is used to manage emails.
Smtlib, defines a session object over which we can send emails and files.
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders
body = '''
Hello, Admin
I am attaching The Sales Files With This Email.
This Year We Got a Wooping 200% Profit One Our Sales.
Regards,
Team Sales
xyz.com
'''
#Sender Email addresses and password
senders_email = 'deltadelta371@gmail.com'
sender_password = 'delta@371'
reveiver_email = 'parasharabhay13@gmail.com'
#MIME Setup
message = MIMEMultipart()
message['From'] = senders_email
message['To'] = reveiver_email
message['Subject'] = 'Sales Report 2021-- Team Sales'
message.attach(MIMEText(body, 'plain'))
## File
attach_file_name = 'car-sales.csv'
attach_file = open(attach_file_name, 'rb')
payload = MIMEBase('application', 'octate-stream')
payload.set_payload((attach_file).read())
encoders.encode_base64(payload)
payload.add_header('Content-Decomposition', 'attachment', filename=attach_file_name)
message.attach(payload)
#SMTP Connection For Sending Email
session = smtplib.SMTP('smtp.gmail.com', 587) #use gmail with port
session.starttls() #enable security
session.login(senders_email, sender_password) #login with mail_id and password
text = message.as_string()
session.sendmail(senders_email, reveiver_email, text)
session.quit()
print('Mail Sent')
Shorting URLs
Sometimes those big URLs become very annoying to read and share.
This script uses an external API to short the URL.
from __future__ import with_statement
import contextlib
try:
from urllib.parse import urlencode
except ImportError:
from urllib import urlencode
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import sys
def make_tiny(url):
request_url = ('http://tinyurl.com/api-create.php?' +
urlencode({'url':url}))
with contextlib.closing(urlopen(request_url)) as response:
return response.read().decode('utf-8')
def main():
for tinyurl in map(make_tiny, sys.argv[1:]):
print(tinyurl)
if __name__ == '__main__':
main()
'''
-----------------------------OUTPUT------------------------
python url_shortener.py https://www.wikipedia.org/
https://tinyurl.com/buf3qt3
'''
Downloading Youtube Videos
I use youtube for 2–3 hours every day sometimes even more.
Most of my learnings come from youtube because it is free and contains a vast amount of information.
There are certain videos that stand out from others that I want to store with me to watch later even when I don’t have an internet connection.
This script does the job for me, by downloading the youtube video for me.
It uses an external API to do the job.
Libraries:-
pytube, is a lightweight Python library for downloading youtube videos.
Tkinter, is one of the most famous and useful GUI Development Library That Makes It Super Easy to Create Awesome GUIs With Fewer Efforts.
Why Tkinter:-
The Whole Concept of the script is to create an interface through which you can download youtube videos by just putting a link.
That Interface can’t be our CLI so we are going to create a simple GUI for our script.
You Can make it even better by running your python code without a console with just one click.
Complete GUI Code
from pytube import YouTube
import pytube
try:
video_url = 'https://www.youtube.com/watch?v=lTTajzrSkCw'
youtube = pytube.YouTube(video_url)
video = youtube.streams.first()
video.download('C:/Users/abhay/Desktop/')
print("Download Successfull !!")
except:
print("Something Went Wrong !!")
Cleaning Download Folder
One of the messiest things in this world is the download folder of a developer.
When writing a blog, working on a project, something similar we just download images and save them with ugly and funny names like asdfg.jpg.
This python script will clean your download folder by renaming and deleting certain files based on some condition.
Libraries:- OS
import os
folder_location = 'C:\\Users\\user\\Downloads\\demo'
os.chdir(folder_location)
list_of_files = os.listdir()
## Selecting All Images
images = [content for content in list_of_files if content.endswith(('.png','.jpg','.jpeg'))]
for index, image in enumerate(images):
os.rename(image,f'{index}.png')
## Deleting All Images
################## Write Your Script Here ######## Try To Create Your Own Code
Sending Text Messages
There are many free text message services available on the internet like Twillo, fast2sms, etc.
Fast2sms provide 50 free messages with a prebuild template to connect your script with their API.
This script will let us send text SMS to any number directly through our command-line interface.
import requests
import json
def send_sms(number, message):
url = 'https://www.fast2sms.com/dev/bulk'
params = {
'authorization': 'FIND_YOUR_OWN',
'sender_id': 'FSTSMS',
'message': message,
'language': 'english',
'route': 'p',
'numbers': number
}
response = requests.get(url, params=params)
dic = response.json()
#print(dic)
return dic.get('return')
num = int(input("Enter The Number:\n"))
msg = input("Enter The Message You Want To Send:\n")
s = send_sms(num, msg)
if s:
print("Successfully sent")
else:
print("Something went wrong..")
Converting hours to seconds
When working on projects that require you to convert hours into seconds, you can use the following Python script.
def convert(seconds):
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
return "%d:%02d:%02d" % (hour, minutes, seconds)
# Driver program
n = 12345
print(convert(n))
Raising a number to the power
Another popular Python script calculates the power of a number. For example, 2 to the power of 4. Here, there are at least three methods to choose from. You can use the math.pow(),pow(), or **. Here is the script.
import math
# Assign values to x and n
x = 4
n = 3
# Method 1
power = x ** n
print("%d to the power %d is %d" % (x,n,power))
# Method 2
power = pow(x,n)
print("%d to the power %d is %d" % (x,n,power))
# Method 3
power = math.pow(2,6.5)
print("%d to the power %d is %5.2f" % (x,n,power))
If/else statement
This is arguably one of the most used statements in Python. It allows your code to execute a function if a certain condition is met. Unlike other languages, you don’t need to use curly braces. Here is a simple if/else script.
# Assign a value
number = 50
# Check the is more than 50 or not
if (number >= 50):
print("You have passed")
else:
print("You have not passed")
Convert images to JPEG
The most conventional systems rarely accept image formats such as PNG. As such, you’ll be required to convert them into JPEG files. Luckily, there’s a Python script that allows you to automate this process.
import os
import sys
from PIL import Image
if len(sys.argv) > 1:
if os.path.exists(sys.argv[1]):
im = Image.open(sys.argv[1])
target_name = sys.argv[1] + ".jpg"
rgb_im = im.convert('RGB')
rgb_im.save(target_name)
print("Saved as " + target_name)
else:
print(sys.argv[1] + " not found")
else:
print("Usage: convert2jpg.py <file>")
Download Google images
If you are working on a project that demands many images, there’s a Python script that enables you to do so. With it, you can download hundreds of images simultaneously. However, you should avoid violating copyright terms. Click here for more information.
Read battery level of Bluetooth device
This script allows you to read the battery level of your Bluetooth headset. This is especially crucial if the level does not display on your PC. However, it does not support all Bluetooth headsets. For it to run, you need to have Docker on your system. Click here for more information.
Delete Telegram messages
Let’s face it, messaging apps do chew up much of your device’s storage space. And Telegram is no different. Luckily, this script allows you to delete all supergroups messages. You need to enter the supergroup’s information for the script to run. Click here for more information.
Get song lyrics
This is yet another popular Python script that enables you to scrape lyrics from the Genius site. It primarily works with Spotify, however, other media players with DBus MediaPlayer2 can also use the script. With it, you can sing along to your favorite song. Click here for more information.
Heroku hosting
Heroku is one of the most preferred hosting services. Used by thousands of developers, it allows you to build apps for free. Likewise, you can host your Python applications and scripts on Heroku with this script. Click here for more information.
Github activity
If you contribute to open source projects, keeping a record of your contributions is recommended. Not only do you track your contributions, but also appear professional when displaying your work to other people. With this script, you can generate a robust activity graph. Click here for information.
Removing duplicate code
When creating large apps or working on projects, it is normal to have duplicates in your list. This not only makes coding strenuous, but also makes your code appear unprofessional. With this script, you can remove duplicates seamlessly.
Sending emails
Emails are crucial to any businesses’ communication avenues. With Python, you can enable sites and web apps to send them without hiccups. However, businesses do not want to send each email manually, instead, they prefer to automate the process. This script allows you to choose which emails to reply to.
Find specific files on your system
Often, you forget the names or location of files on your system. This is not only annoying but also consumes time navigating through different folders. While there are programs that help you search for files, you need one that can automate the process.
Luckily, this script enables you to choose which files and file types to search for. For example, if want to search for MP3 files, you can use this script.
import fnmatch
import os
rootPath = '/'
pattern = '*.mp3'
for root, dirs, files in os.walk(rootPath):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
Generating random passwords
Passwords bolster the privacy of app and website users. Besides, they prevent fraudulent use of accounts by cyber criminals. As such, you need to create an app or website that can generate random strong passwords. With this script, you can seamlessly generate them.
import string
from random import *
characters = string.ascii_letters + string.punctuation + string.digits
password = "".join(choice(characters) for x in range(randint(8, 16)))
print (password)
Print odd numbers
Some projects may require you to print odd numbers within a specific range. While you can do this manually, it is time-consuming and prone to error. This means you need a program that can automate the process. Thanks to this script, you can achieve this.
Get date value
Python allows you to format a date value in numerous ways. With the DateTime module, this script allows you to read the current date and set a custom value.
Removing items from a list
You’ll often have to modify lists on your projects. Python enables you to do this using the Insert() and remove() methods. Here is a script you can use to achieve this.
# Declare a fruit list
fruits = ["Mango","Orange","Guava","Banana"]
# Insert an item in the 2nd position
fruits.insert(1, "Grape")
# Displaying list after inserting
print("The fruit list after insert:")
print(fruits)
# Remove an item
fruits.remove("Guava")
# Print the list after delete
print("The fruit list after delete:")
print(fruits)
Count list items
Using the count() method, you can print how many times a string appears in another string. You need to provide the string that Python will search. Here is a script to help you do so.
# Define the string
string = 'Python Bash Java PHP PHP PERL'
# Define the search string
search = 'P'
# Store the count value
count = string.count(search)
# Print the formatted output
print("%s appears %d times" % (search, count))
https://docs.python.org/3/tutorial/
So far we’ve encountered two ways of writing values: expression statements and the print() function.
(A third way is using the write() method of file objects; the standard output file can be referenced as sys.stdout.
See the Library Reference for more information on this.)
Often you’ll want more control over the formatting of your output than simply printing space-separated values.
There are several ways to format output.
To use formatted string literals, begin a string with f or F before the opening quotation mark or triple quotation mark.
Inside this string, you can write a Python expression between { and }
characters that can refer to variables or literal values.
>>> year = 2016
>>> event = "Referendum"
>>> f"Results of the {year} {event}"
"Results of the 2016 Referendum"
The str.format() method of strings requires more manual effort.
You’ll still use { and } to mark where a variable will be substituted and can provide detailed formatting directives,
but you’ll also need to provide the information to be formatted.
>>> yes_votes = 42_572_654
>>> no_votes = 43_132_495
>>> percentage = yes_votes / (yes_votes + no_votes)
>>> "{:-9} YES votes {:2.2%}".format(yes_votes, percentage)
" 42572654 YES votes 49.67%"
Finally, you can do all the string handling yourself by using string slicing and concatenation operations to create any layout you can imagine.
The string type has some methods that perform useful operations for padding strings to a given column width.
When you don’t need fancy output but just want a quick display of some variables for debugging purposes, you can convert any value to a string with the repr() or str() functions.
The str() function is meant to return representations of values which are fairly human-readable, while repr() is meant to generate representations which can be read by the interpreter (or will force a SyntaxError if there is no equivalent syntax).
For objects which don’t have a particular representation for human consumption, str() will return the same value as
repr().
Many values, such as numbers or structures like lists and dictionaries, have the same representation using either function.
Strings, in particular, have two distinct representations.
Some examples:
>>> s = "Hello, world."
>>> str(s)
"Hello, world."
>>> repr(s)
""Hello, world.""
>>> str(1/7)
"0.14285714285714285"
>>> x = 10 * 3.25
>>> y = 200 * 200
>>> s = "The value of x is " + repr(x) + ", and y is " + repr(y) + "..."
>>> print(s)
The value of x is 32.5, and y is 40000...
>>> # The repr() of a string adds string quotes and backslashes:
...
hello = "hello, world\n"
>>> hellos = repr(hello)
>>> print(hellos)
"hello, world\n"
>>> # The argument to repr() may be any Python object:
...
repr((x, y, ("spam", "eggs")))
"(32.5, 40000, ("spam", "eggs"))"
The string module contains a Template class that offers yet another way to substitute values into strings, using placeholders like
$x and replacing them with values from a dictionary, but offers much less control of the formatting.
7.1.1. Formatted String Literals
Formatted string literals (also called f-strings for short) let you include the value of Python expressions inside a string by prefixing the string with f or F and writing expressions as
{expression}.
An optional format specifier can follow the expression.
This allows greater control over how the value is formatted.
The following example rounds pi to three places after the decimal:
>>> import math
>>> print(f"The value of pi is approximately {math.pi:.3f}.")
The value of pi is approximately 3.142.
Passing an integer after the ':' will cause that field to be a minimum number of characters wide.
This is useful for making columns line up.
>>> table = {"Sjoerd": 4127, "Jack": 4098, "Dcab": 7678}
>>> for name, phone in table.items():
...
print(f"{name:10} ==> {phone:10d}")
...
Sjoerd ==> 4127
Jack ==> 4098
Dcab ==> 7678
Other modifiers can be used to convert the value before it is formatted.
'!a' applies ascii(), '!s' applies str(), and '!r'
applies repr():
>>> animals = "eels"
>>> print(f"My hovercraft is full of {animals}.")
My hovercraft is full of eels.
>>> print(f"My hovercraft is full of {animals!r}.")
My hovercraft is full of "eels".
The = specifier can be used to expand an expression to the text of the expression, an equal sign, then the representation of the evaluated expression:
>>> bugs = "roaches"
>>> count = 13
>>> area = "living room"
>>> print(f"Debugging {bugs=} {count=} {area=}")
Debugging bugs="roaches" count=13 area="living room"
See self-documenting expressions for more information on the = specifier.
For a reference on these format specifications, see the reference guide for the Format Specification Mini-Language.
7.1.2. The String format() Method
Basic usage of the str.format() method looks like this:
>>> print("We are the {} who say "{}!"".format("knights", "Ni"))
We are the knights who say "Ni!"
The brackets and characters within them (called format fields) are replaced with the objects passed into the str.format() method.
A number in the brackets can be used to refer to the position of the object passed into the
str.format() method.
>>> print("{0} and {1}".format("spam", "eggs"))
spam and eggs
>>> print("{1} and {0}".format("spam", "eggs"))
eggs and spam If keyword arguments are used in the str.format() method, their values are referred to by using the name of the argument.
>>> print("This {food} is {adjective}.".format(
...
food="spam", adjective="absolutely horrible"))
This spam is absolutely horrible.
Positional and keyword arguments can be arbitrarily combined:
>>> print("The story of {0}, {1}, and {other}.".format("Bill", "Manfred",
...
other="Georg"))
The story of Bill, Manfred, and Georg.
If you have a really long format string that you don’t want to split up, it would be nice if you could reference the variables to be formatted by name instead of by position.
This can be done by simply passing the dict and using square brackets '[]' to access the keys.
>>> table = {"Sjoerd": 4127, "Jack": 4098, "Dcab": 8637678}
>>> print("Jack: {0[Jack]:d}; Sjoerd: {0[Sjoerd]:d}; "
...
"Dcab: {0[Dcab]:d}".format(table))
Jack: 4098; Sjoerd: 4127; Dcab: 8637678
This could also be done by passing the table dictionary as keyword arguments with the **
notation.
>>> table = {"Sjoerd": 4127, "Jack": 4098, "Dcab": 8637678}
>>> print("Jack: {Jack:d}; Sjoerd: {Sjoerd:d}; Dcab: {Dcab:d}".format(**table))
Jack: 4098; Sjoerd: 4127; Dcab: 8637678
This is particularly useful in combination with the built-in function
vars(), which returns a dictionary containing all local variables.
As an example, the following lines produce a tidily aligned set of columns giving integers and their squares and cubes:
>>> for x in range(1, 11):
...
print("{0:2d} {1:3d} {2:4d}".format(x, x*x, x*x*x))
...
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
For a complete overview of string formatting with str.format(), see Format String Syntax.
7.1.3. Manual String Formatting
Here’s the same table of squares and cubes, formatted manually:
>>> for x in range(1, 11):
...
print(repr(x).rjust(2), repr(x*x).rjust(3), end=" ")
...
# Note use of "end" on previous line
...
print(repr(x*x*x).rjust(4))
...
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
(Note that the one space between each column was added by the way print() works: it always adds spaces between its arguments.)
The str.rjust() method of string objects right-justifies a string in a field of a given width by padding it with spaces on the left.
There are similar methods str.ljust() and str.center().
These methods do not write anything, they just return a new string.
If the input string is too long, they don’t truncate it, but return it unchanged; this will mess up your column lay-out but that’s usually better than the alternative, which would be lying about a value.
(If you really want truncation you can always add a slice operation, as in x.ljust(n)[:n].)
There is another method, str.zfill(), which pads a numeric string on the left with zeros.
It understands about plus and minus signs:
>>> "12".zfill(5)
"00012"
>>> "-3.14".zfill(7)
"-003.14"
>>> "3.14159265359".zfill(5)
"3.14159265359"
7.1.4. Old string formatting
The % operator (modulo) can also be used for string formatting.
Given 'string'
% values, instances of % in string are replaced with zero or more elements of values.
This operation is commonly known as string interpolation.
For example:
>>> import math
>>> print("The value of pi is approximately %5.3f." % math.pi)
The value of pi is approximately 3.142.
More information can be found in the printf-style String Formatting section.
Reading and Writing Files
open() returns a file object, and is most commonly used with two positional arguments and one keyword argument:
open(filename, mode, encoding=None)
>>> f = open("workfile", "w", encoding="utf-8")
The first argument is a string containing the filename.
The second argument is another string containing a few characters describing the way in which the file will be used.
mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end.
'r+' opens the file for both reading and writing.
The mode argument is optional; 'r' will be assumed if it’s omitted.
Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding.
If encoding is not specified, the default is platform dependent (see open()).
Because UTF-8 is the modern de-facto standard, encoding="utf-8" is recommended unless you know that you need to use a different encoding.
Appending a 'b' to the mode opens the file in binary mode.
Binary mode data is read and written as bytes objects.
You can not specify encoding when opening file in binary mode.
In text mode, the default when reading is to convert platform-specific line endings (\n on Unix, \r\n on Windows) to just \n.
When writing in text mode, the default is to convert occurrences of \n back to platform-specific line endings.
This behind-the-scenes modification to file data is fine for text files, but will corrupt binary data like that in
JPEG or EXE files.
Be very careful to use binary mode when reading and writing such files.
It is good practice to use the with keyword when dealing with file objects.
The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point.
Using with is also much shorter than writing equivalent try-finally blocks:
>>> with open("workfile", encoding="utf-8") as f:
...
read_data = f.read()
>>> # We can check that the file has been automatically closed.
>>> f.closed True If you’re not using the with keyword, then you should call f.close() to close the file and immediately free up any system resources used by it.
Warning Calling f.write() without using the with keyword or calling f.close()might result in the arguments of f.write() not being completely written to the disk, even if the program exits successfully.
After a file object is closed, either by a with statement or by calling f.close(), attempts to use the file object will automatically fail.
>>> f.close()
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
7.2.1. Methods of File Objects
The rest of the examples in this section will assume that a file object called f has already been created.
To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode).
size is an optional numeric argument.
When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.
Otherwise, at most size characters (in text mode) or size bytes (in binary mode) are read and returned.
If the end of the file has been reached, f.read() will return an empty string ('').
>>> f.read()
"This is the entire file.\n"
>>> f.read()
""
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline.
This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline.
>>> f.readline()
"This is the first line of the file.\n"
>>> f.readline()
"Second line of the file\n"
>>> f.readline()
""
For reading lines from a file, you can loop over the file object.
This is memory efficient, fast, and leads to simple code:
>>> for line in f:
...
print(line, end=")
...
This is the first line of the file.
Second line of the file If you want to read all the lines of a file in a list you can also use
list(f) or f.readlines().
f.write(string) writes the contents of string to the file, returning the number of characters written.
>>> f.write("This is a test\n")
15
Other types of objects need to be converted – either to a string (in text mode) or a bytes object (in binary mode) – before writing them:
>>> value = ("the answer", 42)
>>> s = str(value) # convert the tuple to string
>>> f.write(s)
18
f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.
To change the file object’s position, use f.seek(offset, whence).
The position is computed from adding offset to a reference point; the reference point is selected by the whence argument.
A whence value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point.
whence can be omitted and defaults to 0, using the beginning of the file as the reference point.
>>> f = open("workfile", "rb+")
>>> f.write(b"0123456789abcdef")
16
>>> f.seek(5) # Go to the 6th byte in the file
5
>>> f.read(1)
b"5"
>>> f.seek(-3, 2) # Go to the 3rd byte before the end
13
>>> f.read(1)
b"d"
In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero.
Any other offset value produces undefined behaviour.
File objects have some additional methods, such as isatty() and
truncate() which are less frequently used; consult the Library Reference for a complete guide to file objects.
7.2.2. Saving structured data with json
Strings can easily be written to and read from a file.
Numbers take a bit more effort, since the read() method only returns strings, which will have to be passed to a function like int(), which takes a string like '123' and returns its numeric value 123.
When you want to save more complex data types like nested lists and dictionaries, parsing and serializing by hand becomes complicated.
Rather than having users constantly writing and debugging code to save complicated data types to files, Python allows you to use the popular data interchange format called JSON (JavaScript Object Notation).
The standard module called json can take Python data hierarchies, and convert them to string representations; this process is called serializing.
Reconstructing the data from the string representation is called deserializing.
Between serializing and deserializing, the string representing the object may have been stored in a file or data, or sent over a network connection to some distant machine.
Note The JSON format is commonly used by modern applications to allow for data exchange.
Many programmers are already familiar with it, which makes it a good choice for interoperability.
If you have an object x, you can view its JSON string representation with a simple line of code:
>>> import json
>>> x = [1, "simple", "list"]
>>> json.dumps(x)
"[1, "simple", "list"]"
Another variant of the dumps() function, called dump(), simply serializes the object to a text file.
So if f is a text file object opened for writing, we can do this:
json.dump(x, f)
To decode the object again, if f is a binary file or text file object which has been opened for reading:
x = json.load(f)
Note JSON files must be encoded in UTF-8.
Use encoding="utf-8" when opening JSON file as a text file for both of reading and writing.
This simple serialization technique can handle lists and dictionaries, but serializing arbitrary class instances in JSON requires a bit of extra effort.
The reference for the json module contains an explanation of this.
See also
pickle - the pickle module Contrary to JSON, pickle is a protocol which allows the serialization of arbitrarily complex Python objects.
As such, it is specific to Python and cannot be used to communicate with applications written in other languages.
It is also insecure by default: deserializing pickle data coming from an untrusted source can execute arbitrary code, if the data was crafted by a skilled attacker.
Python Dictionaries
A dictionary is a collection which is ordered, changeable and do not allow duplicates.
Dictionaries are used to store data values in key:value pairs.
Dictionaries cannot have two items with the same key.
Duplicate values will overwrite existing values.
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict["brand"])
This automation script will help you to extract the HTML from the webpage URL and then also provide you function that you can use to Parse the HTML for data.
This awesome script is a great treat for web scrapers and for those who want to Parse HTML for important data.
# Parse and Extract HTML
# pip install gazpacho
import gazpacho
# Extract HTML from URL
url = 'https://www.example.com/'
html = gazpacho.get(url)
print(html)
# Extract HTML with Headers
headers = {'User-Agent': 'Mozilla/5.0'}
html = gazpacho.get(url, headers=headers)
print(html)
# Parse HTML
parse = gazpacho.Soup(html)
# Find single tags
tag1 = parse.find('h1')
tag2 = parse.find('span')
# Find multiple tags
tags1 = parse.find_all('p')
tags2 = parse.find_all('a')
# Find tags by class
tag = parse.find('.class')
# Find tags by Attribute
tag = parse.find("div", attrs={"class": "test"})
# Extract text from tags
text = parse.find('h1').text
text = parse.find_all('p')[0].text
Qrcode Scanner
Having a lot of Qr images or just want to scan a QR image then this automation script will help you with it.
This script uses the Qrtools module that will enable you to scan your QR images programmatically.
# Qrcode Scanner
# pip install qrtools
from qrtools import Qr
def Scan_Qr(qr_img):
qr = Qr()
qr.decode(qr_img)
print(qr.data)
return qr.data
print("Your Qr Code is: ", Scan_Qr("qr.png"))
Take Screenshots
Now you can Take Screenshots programmatically by using this awesome script below.
With this script, you can take a direct screenshots or take specific area screenshots too.
# Grab Screenshot
# pip install pyautogui
# pip install Pillow
from pyautogui import screenshot
import time
from PIL import ImageGrab
# Grab Screenshot of Screen
def grab_screenshot():
shot = screenshot()
shot.save('my_screenshot.png')
# Grab Screenshot of Specific Area
def grab_screenshot_area():
area = (0, 0, 500, 500)
shot = ImageGrab.grab(area)
shot.save('my_screenshot_area.png')
# Grab Screenshot with Delay
def grab_screenshot_delay():
time.sleep(5)
shot = screenshot()
shot.save('my_screenshot_delay.png')
Create AudioBooks
Tired of converting Your PDF books to Audiobooks manually, Then here is your automation script that uses the GTTS module that will convert your PDF text to audio.
# Create Audiobooks
# pip install gTTS
# pip install PyPDF2
from PyPDF2 import PdfFileReader as reader
from gtts import gTTS
def create_audio(pdf_file):
read_Pdf = reader(open(pdf_file, 'rb'))
for page in range(read_Pdf.numPages):
text = read_Pdf.getPage(page).extractText()
tts = gTTS(text, lang='en')
tts.save('page' + str(page) + '.mp3')
create_audio('book.pdf')
PDF Editor
Use this below automation script to Edit your PDF files with Python.
This script uses the PyPDF4 module which is the upgrade version of PyPDF2 and below I coded the common function like Parse Text, Remove pages, and many more.
Handy script when you have a lot of PDFs to Edit or need a script in your Python Project programmatically.
# PDF Editor
# pip install PyPDf4
import PyPDF4
# Parse the Text from PDF
def parse_text(pdf_file):
reader = PyPDF4.PdfFileReader(pdf_file)
for page in reader.pages:
print(page.extractText())
# Remove Page from PDF
def remove_page(pdf_file, page_numbers):
filer = PyPDF4.PdfReader('source.pdf', 'rb')
out = PyPDF4.PdfWriter()
for index in page_numbers:
page = filer.pages[index]
out.add_page(page)
with open('rm.pdf', 'wb') as f:
out.write(f)
# Add Blank Page to PDF
def add_page(pdf_file, page_number):
reader = PyPDF4.PdfFileReader(pdf_file)
writer = PyPDF4.PdfWriter()
writer.addPage()
with open('add.pdf', 'wb') as f:
writer.write(f)
# Rotate Pages
def rotate_page(pdf_file):
reader = PyPDF4.PdfFileReader(pdf_file)
writer = PyPDF4.PdfWriter()
for page in reader.pages:
page.rotateClockwise(90)
writer.addPage(page)
with open('rotate.pdf', 'wb') as f:
writer.write(f)
# Merge PDFs
def merge_pdfs(pdf_file1, pdf_file2):
pdf1 = PyPDF4.PdfFileReader(pdf_file1)
pdf2 = PyPDF4.PdfFileReader(pdf_file2)
writer = PyPDF4.PdfWriter()
for page in pdf1.pages:
writer.addPage(page)
for page in pdf2.pages:
writer.addPage(page)
with open('merge.pdf', 'wb') as f:
writer.write(f)
👉Mini Stackoverflow
As a programmer I know we need StackOverflow every day but you no longer need to go and search on Google for it.
Now get direct solutions in your CMD while you continue working on a project.
By using Howdoi module you can get the StackOverflow solution in your command prompt or terminal.
Below you can find some examples that you can try.
# Automate Stackoverflow
# pip install howdoi
# Get Answers in CMD
#example 1
> howdoi how do i install python3
# example 2
> howdoi selenium Enter keys
# example 3
> howdoi how to install modules
# example 4
> howdoi Parse html with python
# example 5
> howdoi int not iterable error
# example 6
> howdoi how to parse pdf with python
# example 7
> howdoi Sort list in python
# example 8
> howdoi merge two lists in python
# example 9
>howdoi get last element in list python
# example 10
> howdoi fast way to sort list
Automate Mobile Phone
This automation script will help you to automate your Smart Phone by using the Android debug bridge (ADB) in Python.
Below I show how you can automate common tasks like swipe gestures, calling, sending Sms, and much more.
You can learn more about ADB and explore more exciting ways to automate your phones for making your life easier.
# Automate Mobile Phones
# pip install opencv-python
import subprocess
def main_adb(cm):
p = subprocess.Popen(cm.split(' '), stdout=subprocess.PIPE, shell=True)
(output, _) = p.communicate()
return output.decode('utf-8')
# Swipe
def swipe(x1, y1, x2, y2, duration):
cmd = 'adb shell input swipe {} {} {} {} {}'.format(x1, y1, x2, y2, duration)
return main_adb(cmd)
# Tap or Clicking
def tap(x, y):
cmd = 'adb shell input tap {} {}'.format(x, y)
return main_adb(cmd)
# Make a Call
def make_call(number):
cmd = f"adb shell am start -a android.intent.action.CALL -d tel:{number}"
return main_adb(cmd)
# Send SMS
def send_sms(number, message):
cmd = 'adb shell am start -a android.intent.action.SENDTO -d sms:{} --es sms_body "{}"'.format(number, message)
return main_adb(cmd)
# Download File From Mobile to PC
def download_file(file_name):
cmd = 'adb pull /sdcard/{}'.format(file_name)
return main_adb(cmd)
# Take a screenshot
def screenshot():
cmd = 'adb shell screencap -p'
return main_adb(cmd)
# Power On and Off
def power_off():
cmd = '"adb shell input keyevent 26"'
return main_adb(cmd)
Monitor CPU/GPU Temp
You Probably use CPU-Z or any specs monitoring software to capture your Cpu and Gpu temperature but you know you can do that programmatically too.
Well, this script uses the Pythonnet and OpenhardwareMonitor that help you to monitor your current Cpu and Gpu Temperature.
You can use it to notify yourself when a certain amount of temperature reaches or you can use it in your Python project to make your daily life easy.
# Get CPU/GPU Temperature
# pip install pythonnet
import clr
clr.AddReference("OpenHardwareMonitorLib")
from OpenHardwareMonitorLib import *
spec = Computer()
spec.GPUEnabled = True
spec.CPUEnabled = True
spec.Open()
# Get CPU Temp
def Cpu_Temp():
while True:
for cpu in range(0, len(spec.Hardware[0].Sensors)):
if "/temperature" in str(spec.Hardware[0].Sensors[cpu].Identifier):
print(str(spec.Hardware[0].Sensors[cpu].Value))
# Get GPU Temp
def Gpu_Temp()
while True:
for gpu in range(0, len(spec.Hardware[0].Sensors)):
if "/temperature" in str(spec.Hardware[0].Sensors[gpu].Identifier):
print(str(spec.Hardware[0].Sensors[gpu].Value))
Instagram Uploader Bot
Instagram is a well famous social media platform and you know you don’t need to upload your photos or video through your smartphone now.
You can do it programmatically by using the below script.
# Upload Photos and Video on Insta
# pip install instabot
from instabot import Bot
def Upload_Photo(img):
robot = Bot()
robot.login(username="user", password="pass")
robot.upload_photo(img, caption="Medium Article")
print("Photo Uploaded")
def Upload_Video(video):
robot = Bot()
robot.login(username="user", password="pass")
robot.upload_video(video, caption="Medium Article")
print("Video Uploaded")
def Upload_Story(img):
robot = Bot()
robot.login(username="user", password="pass")
robot.upload_story(img, caption="Medium Article")
print("Story Photos Uploaded")
Upload_Photo("img.jpg")
Upload_Video("video.mp4")
Video Watermarker
Add watermark to your videos by using this automation script which uses Moviepy which is a handy module for video editing.
In the below script, you can see how you can watermark and you are free to use it.
# Video Watermark with Python
# pip install moviepy
from moviepy.editor import *
clip = VideoFileClip("myvideo.mp4", audio=True)
width,height = clip.size
text = TextClip("WaterMark", font='Arial', color='white', fontsize=28)
set_color = text.on_color(size=(clip.w + text.w, text.h-10), color=(0,0,0), pos=(6,'center'), col_opacity=0.6)
set_textPos = set_color.set_pos( lambda pos: (max(width/30,int(width-0.5* width* pos)),max(5*height/6,int(100* pos))) )
Output = CompositeVideoClip([clip, set_textPos])
Output.duration = clip.duration
Output.write_videofile("output.mp4", fps=30, codec='libx264')
macro inside an Excel file using Python
use the xlwings library.
xlwings allows you to interact with Excel files and access macros.
Install the xlwings library by running the following command in your Python environment:
pip install xlwings
import xlwings as xw
Use the xw.Book() function to open the Excel file containing the macro:
wb = xw.Book('path_to_your_excel_file.xlsx')
Access the macro within the Excel file using the macro attribute of the Workbook object:
macro_code = wb.macro('macro_name')
Replace `'macro_name'` with the name of the macro you want to read.
Print or manipulate the macro_code as needed:
print(macro_code)
You can save it to a file or process it further, depending on your requirements.
Close the workbook after you have finished reading the macro:
wb.close()
To list out all macros
Access the macros attribute of the Workbook object to obtain a list of all macros:
macro_list = wb.macro_names
Print or process the macro_list as needed:
for macro_name in macro_list:
print(macro_name)
List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list.
Example:
Based on a list of fruits, you want a new list, containing only the fruits with the letter "a" in the name.
Without list comprehension you will have to write a for statement with a conditional test inside:
Example fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
newlist = []
for x in fruits:
if "a" in x:
newlist.append(x)
print(newlist)
With list comprehension you can do all that with only one line of code:
Example fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
newlist = [x for x in fruits if "a" in x]
print(newlist)
The Syntax
newlist = [expression for item in iterable if condition == True]
The return value is a new list, leaving the old list unchanged.
Condition
The condition is like a filter that only accepts the items that valuate to
True.
Example Only accept items that are not "apple":
newlist = [x for x in fruits if x != "apple"]
The condition
if x != "apple" will return True for all elements other than "apple", making the new list contain all fruits except "apple".
The condition is optional and can be omitted:
Example With no if statement:
newlist = [x for x in fruits]
Iterable
The iterable can be any iterable object, like a list, tuple, set etc.
Example You can use the range() function to create an iterable:
newlist = [x for x in range(10)]
Same example, but with a condition:
Example Accept only numbers lower than 5:
newlist = [x for x in range(10) if x < 5]
Expression
The expression is the current item in the iteration, but it is also the outcome, which you can manipulate before it ends up like a list item in the new list:
Example Set the values in the new list to upper case:
newlist = [x.upper() for x in fruits]
You can set the outcome to whatever you like:
Example Set all values in the new list to 'hello':
newlist = ['hello' for x in fruits]
The expression can also contain conditions, not like a filter, but as a way to manipulate the outcome:
Example Return "orange" instead of "banana":
newlist = [x if x != "banana" else "orange" for x in fruits]
The expression in the example above says:
"Return the item if it is not banana, if it is banana return orange". 1、For 循环
for 循环是一个多行语句,但是在 Python 中,我们可以使用 List Comprehension 方法在一行中编写 for 循环。
让我们以过滤小于 250 的值为例。
示例代码如下:
#For loop in One line
mylist = [100, 200, 300, 400, 500]
#Orignal way
result = []
for x in mylist:
if x > 250:
result.append(x)
print(result) # [300, 400, 500]
#One Line Way
result = [x for x in mylist if x > 250]
print(result) # [300, 400, 500]2、 While 循环
这个 One-Liner 片段将向您展示如何在 One Line 中使用 While 循环代码,在这里,我已经展示了两种方法。
代码如下:
#method 1 Single Statement
while True: print(1) # infinite 1
#method 2 Multiple Statement
x = 0
while x < 5: print(x); x= x + 1 # 0 1 2 3 4 53、IF Else 语句
好吧,要在 One Line 中编写 IF Else 语句,我们将使用三元运算符。
三元的语法是“[on true] if [expression] else [on false]”。
我在下面的示例代码中展示了 3 个示例,以使您清楚地了解如何将三元运算符用于一行 if-else 语句,要使用 Elif 语句,我们必须使用多个三元运算符。
#if Else in One Line
#Example 1 if else
print("Yes") if 8 > 9 else print("No") # No
#Example 2 if elif else
E = 2
print("High") if E == 5 else print("Meidum") if E == 2 else print("Low") # Medium
#Example 3 only if
if 3 > 2: print("Exactly") # Exactly4、合并字典
这个单行代码段将向您展示如何使用一行代码将两个字典合并为一个。
下面我展示了两种合并字典的方法。
# Merge Dictionary in One Line
d1 = { 'A': 1, 'B': 2 }
d2 = { 'C': 3, 'D': 4 }
#method 1
d1.update(d2)
print(d1) # {'A': 1, 'B': 2, 'C': 3, 'D': 4}
#method 2
d3 = {**d1, **d2}
print(d3) # {'A': 1, 'B': 2, 'C': 3, 'D': 4}5、编写函数
我们有两种方法可以在一行中编写函数,在第一种方法中,我们将使用与三元运算符或单行循环方法相同的函数定义。
第二种方法是用 lambda 定义函数,查看下面的示例代码以获得更清晰的理解。
#Function in One Line
#method 1
def fun(x): return True if x % 2 == 0 else False
print(fun(2)) # False
#method 2
fun = lambda x : x % 2 == 0
print(fun(2)) # True
print(fun(3)) # False6、单行递归
这个单行代码片段将展示如何在一行中使用递归,我们将使用一行函数定义和一行 if-else 语句,下面是查找斐波那契数的示例。
# Recursion in One Line
#Fibonaci example with one line Recursion
def Fib(x): return 1 if x in {0, 1} else Fib(x-1) + Fib(x-2)
print(Fib(5)) # 8
print(Fib(15)) # 9877、数组过滤
Python 列表可以通过使用列表推导方法在一行代码中进行过滤,让我们以过滤偶数列表为例。
# Array Filtering in One Line
mylist = [2, 3, 5, 8, 9, 12, 13, 15]
#Normal Way
result = []
for x in mylist:
if x % 2 == 0:
result.append(x)
print(result) # [2, 8, 12]
#One Line Way
result = [x for x in mylist if x % 2 == 0]
print(result) # [2, 8, 12]8、异常处理
我们使用异常处理来处理 Python 中的运行时错误,你知道我们可以在 One-Line 中编写这个 Try except 语句吗?通过使用 exec() 语句,我们可以做到这一点。
# Exception Handling in One Line
#Original Way
try:
print(x)
except:
print("Error")
#One Line Way
exec('try:print(x) \nexcept:print("Error")') # Error9、列出字典
我们可以使用 Python enumerate() 函数将 List 转换为 Dictionary in One Line,在 enumerate() 中传递列表并使用 dict() 将最终输出转换为字典格式。
# Dictionary in One line
mydict = ["John", "Peter", "Mathew", "Tom"]
mydict = dict(enumerate(mydict))
print(mydict) # {0: 'John', 1: 'Peter', 2: 'Mathew', 3: 'Tom'}10、多变量赋值
Python 允许在一行中进行多个变量赋值,下面的示例代码将向您展示如何做到这一点。
#Multi Line Variable
#Normal Way
x = 5
y = 7
z = 10
print(x , y, z) # 5 7 10
#One Line way
a, b, c = 5, 7, 10
print(a, b, c) # 5 7 1011、交换
交换是编程中一项有趣的任务,并且总是需要第三个变量名称 temp 来保存交换值。
这个单行代码段将向您展示如何在没有任何临时变量的情况下交换一行中的值。
#Swap in One Line
#Normal way
v1 = 100
v2 = 200
temp = v1
v1 = v2
v2 = temp
print(v1, v2) # 200 100
# One Line Swapping
v1, v2 = v2, v1
print(v1, v2) # 200 10012、排序
排序是编程中的一个普遍问题,Python 有许多内置的方法来解决这个排序问题,下面的代码示例将展示如何在一行中进行排序。
# Sort in One Line
mylist = [32, 22, 11, 4, 6, 8, 12]
# method 1
mylist.sort()
print(mylist) # # [4, 6, 8, 11, 12, 22, 32]
print(sorted(mylist)) # [4, 6, 8, 11, 12, 22, 32]13、读取文件
不使用语句或正常读取方法,也可以正确读取一行文件。
#Read File in One Line
#Normal Way
with open("data.txt", "r") as file:
data = file.readline()
print(data) # Hello world
#One Line Way
data = [line.strip() for line in open("data.txt","r")]
print(data) # ['hello world', 'Hello Python']14、类
类总是多线工作,但是在 Python 中,有一些方法可以在一行代码中使用类特性。
# Class in One Line
#Normal way
class Emp:
def __init__(self, name, age):
self.name = name
self.age = age
emp1 = Emp("Haider", 22)
print(emp1.name, emp1.age) # Haider 22
#One Line Way#method 1 Lambda with Dynamic Artibutes
Emp = lambda: None; Emp.name = "Haider"; Emp.age = 22
print(Emp.name, Emp.age) # Haider 22
#method 2
from collections import namedtuple
Emp = namedtuple('Emp', ["name", "age"]) ("Haider", 22)
print(Emp.name, Emp.age) # Haider 2215、分号
一行代码片段中的分号将向您展示如何使用分号在一行中编写多行代码。
# Semi colon in One Line
#example 1
a = "Python"; b = "Programming"; c = "Language"; print(a, b, c)
#output:
# Python Programming Language16、打印
这不是很重要的 Snippet,但有时当您不需要使用循环来执行任务时它很有用。
# Print in One Line
#Normal Way
for x in range(1, 5):
print(x) # 1 2 3 4
#One Line Way
print(*range(1, 5)) # 1 2 3 4
print(*range(1, 6)) # 1 2 3 4 517、Map 函数
Map 函数是适用的高阶函数,这将函数应用于每个元素,下面是我们如何在一行代码中使用 map 函数的示例。
#Map in One Line
print(list(map(lambda a: a + 2, [5, 6, 7, 8, 9, 10])))
#output
# [7, 8, 9, 10, 11, 12]18、删除列表中的 Mul 元素
您现在可以使用 del 方法在一行代码中删除 List 中的多个元素,只需稍作修改。
# Delete Mul Element in One Line
mylist = [100, 200, 300, 400, 500]
del mylist[1::2]
print(mylist) # [100, 300, 500]19、打印图案
现在您不再需要使用 Loop 来打印相同的图案,您可以使用 Print 语句和星号 (*) 在一行代码中执行相同的操作。
# Print Pattern in One Line
# Normal Way
for x in range(3):
print('😀')
# output
# 😀 😀 😀
#One Line way
print('😀' * 3) # 😀 😀 😀
print('😀' * 2) # 😀 😀
print('😀' * 1) # 😀20、查找质数
此代码段将向您展示如何编写单行代码来查找范围内的质数。
# Find Prime Number
print(list(filter(lambda a: all(a % b != 0 for b in range(2, a)), range(2,20))))
#Output
# [2, 3, 5, 7, 11, 13, 17, 19]
student info management
# 学生信息放在字典里面
student_info = [
{'姓名': '婧琪', '语文': 60, '数学': 60, '英语': 60, '总分': 180},
{'姓名': '巳月', '语文': 60, '数学': 60, '英语': 60, '总分': 180},
{'姓名': '落落', '语文': 60, '数学': 60, '英语': 60, '总分': 180},
]
# 死循环 while True
# 源码自取君羊:708525271
while True:
print(msg)
num = input('请输入你想要进行操作: ')
# 进行判断, 判断输入内容是什么, 然后返回相应结果
if num == '1':
name = input('请输入学生姓名: ')
chinese = int(input('请输入语文成绩: '))
math = int(input('请输入数学成绩: '))
english = int(input('请输入英语成绩: '))
score = chinese + math + english # 总分
student_dit = { # 把信息内容, 放入字典里面
'姓名': name,
'语文': chinese,
'数学': math,
'英语': english,
'总分': score,
}
student_info.append(student_dit) # 把学生信息 添加到列表里面
elif num == '2':
print('姓名\t\t语文\t\t数学\t\t英语\t\t总分')
for student in student_info:
print(
student['姓名'], '\t\t',
student['语文'], '\t\t',
student['数学'], '\t\t',
student['英语'], '\t\t',
student['总分'],
)
elif num == '3':
name = input('请输入查询学生姓名: ')
for student in student_info:
if name == student['姓名']: # 判断 查询名字和学生名字 是否一致
print('姓名\t\t语文\t\t数学\t\t英语\t\t总分')
print(
student['姓名'], '\t\t',
student['语文'], '\t\t',
student['数学'], '\t\t',
student['英语'], '\t\t',
student['总分'],
)
break
else:
print('查无此人, 没有{}学生信息!'.format(name))
elif num == '4':
name = input('请输入删除学生姓名: ')
for student in student_info:
if name == student['姓名']:
print('姓名\t\t语文\t\t数学\t\t英语\t\t总分')
print(
student['姓名'], '\t\t',
student['语文'], '\t\t',
student['数学'], '\t\t',
student['英语'], '\t\t',
student['总分'],
)
choose = input(f'是否确定要删除{name}信息(y/n)')
if choose == 'y' or choose == 'Y':
student_info.remove(student)
print(f'{name}信息已经被删除!')
break
elif choose == 'n' or choose == 'N':
break
else:
print('查无此人, 没有{}学生信息!'.format(name))
elif num == '5':
print('修改学生信息')
name = input('请输入删除学生姓名: ')
for student in student_info:
if name == student['姓名']:
print('姓名\t\t语文\t\t数学\t\t英语\t\t总分')
print(
student['姓名'], '\t\t',
student['语文'], '\t\t',
student['数学'], '\t\t',
student['英语'], '\t\t',
student['总分'],
)
choose = input(f'是否要修改{name}信息(y/n)')
if choose == 'y' or choose == 'Y':
name = input('请输入学生姓名: ')
chinese = int(input('请输入语文成绩: '))
math = int(input('请输入数学成绩: '))
english = int(input('请输入英语成绩: '))
score = chinese + math + english # 总分
student['姓名'] = name
student['语文'] = chinese
student['数学'] = math
student['英语'] = english
student['总分'] = score
print(f'{name}信息已经修改了!')
break
elif choose == 'n' or choose == 'N':
# 跳出循环
break
else:
print('查无此人, 没有{}学生信息!'.format(name))
if else 升级新语法
Python 从 if else 优化到 match casePython 是一门非常重 if else 的语言
以前 Python 真的是把 if else 用到了极致,比如说 Python 里面没有三元运算符( xx ? y : z ) 无所谓,它可以用 if else 整一个。
x = True if 100 > 0 else False离谱的事还没有完,if else 这两老六还可以分别与其它语法结合,其中又数 else 玩的最野。
a: else 可以和 try 玩到一起,当 try 中没有引发异常的时候 else 块会得到执行。
#!/usr/bin/env python3
# -*- coding: utf8 -*-def main():
try:
# ...
pass
except Exception as err:
pass
else:
print("this is else block")
finally:
print("finally block")if __name__ == "__main__":
main()b: else 也可以配合循环语句使用,当循环体中没有执行 break 语句时 else 块能得到执行。
#!/usr/bin/env python3
# -*- coding: utf8 -*-def main():
for i in range(3):
pass
else:
print("this is else block") while False:
pass
else:
print("this is else block")if __name__ == "__main__":
main()
c: if 相对来说就没有 else 那么多的副业;常见的就是列表推导。
以过滤出列表中的偶数为例,传统上我们的代码可能是这样的。
#!/usr/bin/env python3
# -*- coding: utf8 -*-def main():
result = [] numers = [1, 2, 3, 4, 5]
for number in numers:
if number % 2 == 0:
result.append(number)
print(result)
if __name__ == "__main__":
main()使用列表推导可以一行解决。
#!/usr/bin/env python3
# -*- coding: utf8 -*-def main():
numers = [1, 2, 3, 4, 5]
print( [_ for _ in numers if _ % 2 == 0] )if __name__ == "__main__":
main()看起来这些增强都还可以,但是对于类似于 switch 的这些场景,就不理想了。
没有 switch 语句 if else 顶上
对于 Python 这种把 if else 在语法上用到极致的语言,没有 switch 语句没关系的,它可以用 if else !!!
#!/usr/bin/env python3
# -*- coding: utf8 -*-def fun(times):
"""这个函数不是我们测试的重点这里直接留白 Parameter
---------
times: int """
passdef main(case_id: int):
"""由 case_id 到调用函数还有其它逻辑,这里为了简单统一处理在 100 * case_id Parameter
---------
times: int """
if case_id == 1:
fun(100 * 1)
elif case_id == 2:
fun(100 * 2)
elif case_id == 3:
fun(100 * 3)
elif case_id == 4:
fun(100 * 4)if __name__ == "__main__":
main(1)这个代码写出来大家应该发现了,这样的代码像流水账一样一点都不优雅,用 Python 的话来说,这个叫一点都不 Pythonic !其它语言不好说,对于 Python 来讲不优雅就是有罪。
前面铺垫了这么多,终于快到重点了。
社区提出了一个相对优雅的写法,新写法完全不用 if else 。
#!/usr/bin/env python3
# -*- coding: utf8 -*-def fun(times):
pass# 用字典以 case 为键,要执行的函数对象为值,这样做到按 case 路由
routers = {
1: fun,
2: fun,
3: fun,
4: fun
}def main(case_id: int):
routers[case_id](100 * case_id)if __name__ == "__main__":
main(1)可以看到新的写法下,代码确实简洁了不少;从另一个角度来看社区也完成了一次进化,从之前抱着 if else 这个传家宝不放,到完全不用 if else 。
也算是非常有意思吧。
新写法也不是没有问题;性能!性能!还是他妈的性能不行!
if else 和宝典写法性能测试
在说测试结果之前,先介绍一下我的开发环境,腾讯云的虚拟机器,Python 版本是 Python-3.12.0a3 。
测试代码会记录耗时和内存开销,耗时小的性能就好。
详细的代码如下。
#!/usr/bin/env python3
# -*- coding: utf8 -*-import timeit
import tracemalloctracemalloc.start()def fun(times):
"""这个函数不是我们测试的重点这里直接留白 Parameter
---------
times: int """
pass# 定义 case 到 操作的路由字典
routers = {
1: fun,
2: fun,
3: fun,
4: fun
}def main(case_id: int):
"""用于测试 if else 写法的耗时情况 Parametr
--------
case_id: int
不同 case 的唯一标识 Return
------
None
"""
if case_id == 1:
fun(100 * 1)
elif case_id == 2:
fun(100 * 2)
elif case_id == 3:
fun(100 * 3)
elif case_id == 4:
fun(100 * 4)def main(case_id: int):
"""测试字典定法的耗时情况 Parametr
--------
case_id: int
不同 case 的唯一标识 Return
------
None
"""
routers[case_id](100 * case_id)if __name__ == "__main__":
# 1. 记录开始时间、内存
# 2. 性能测试
# 3. 记录结束时间和总的耗时情况 start_current, start_peak = tracemalloc.get_traced_memory()
start_at = timeit.default_timer() for i in range(10000000):
main((i % 4) + 1)
end_at = timeit.timeit() cost = timeit.default_timer() - start_at
end_current, end_peak = tracemalloc.get_traced_memory()
print(f"time cost = {cost} .")
print(f"memery cost = {end_current - start_current}, {end_peak - start_peak}")下面直接上我在开发环境的测试结果。
文字版本。
可以看到字典写法虽然优雅了一些,但是它在性能上是不行的。
故事讲到这里,我们这次的主角要上场了。
match case 新语法Python-3.10 版本引入了一个新的语法 match case ,这个新语法和其它语言的 switch case 差不多。
在性能上比字典写法好一点,在代码的优雅程度上比 if else 好一点。
大致语法像这样。
match xxx:
case aaa:
...
case bbb:
...
case ccc:
...
case ddd:
...
光说不练,假把式!改一下我们的测试代码然后比较一下三者的性能差异。
#!/usr/bin/env python3
# -*- coding: utf8 -*-import timeit
import tracemalloctracemalloc.start()def fun(times):
"""这个函数不是我们测试的重点这里直接留白 Parameter
---------
times: int """
pass# 定义 case 到 操作的路由字典
routers = {
1: fun,
2: fun,
3: fun,
4: fun
}def main(case_id: int):
"""用于测试 if else 写法的耗时情况 Parametr
--------
case_id: int
不同 case 的唯一标识 Return
------
None
"""
if case_id == 1:
fun(100 * 1)
elif case_id == 2:
fun(100 * 2)
elif case_id == 3:
fun(100 * 3)
elif case_id == 4:
fun(100 * 4)def main(case_id: int):
"""测试字典定法的耗时情况 Parametr
--------
case_id: int
不同 case 的唯一标识 Return
------
None
"""
routers[case_id](100 * case_id)def main(case_id: int):
"""测试 match case 写法的耗时情况 Parametr
--------
case_id: int
不同 case 的唯一标识 Return
------
None
"""
match case_id:
case 1:
fun(100 * 1)
case 2:
fun(100 * 2)
case 3:
fun(100 * 3)
case 4:
fun(100 * 4)if __name__ == "__main__":
# 1. 记录开始时间、内存
# 2. 性能测试
# 3. 记录结束时间和总的耗时情况 start_current, start_peak = tracemalloc.get_traced_memory()
start_at = timeit.default_timer() for i in range(10000000):
main((i % 4) + 1)
end_at = timeit.timeit() cost = timeit.default_timer() - start_at
end_current, end_peak = tracemalloc.get_traced_memory()
print(f"time cost = {cost} .")
print(f"memery cost = {end_current - start_current}, {end_peak - start_peak}")
可以看到 match case 耗时还是比较理想的。
详细的数据如下。
20 Python libraries
Requests.
The most famous http library written by Kenneth Reitz.
It's a must have for every python developer.
Scrapy.
If you are involved in webscraping then this is a must have library for you.
After using this library you won't use any other.
wxPython.
A gui toolkit for python.
I have primarily used it in place of tkinter.
You will really love it.
Pillow.
A friendly fork of PIL (Python Imaging Library).
It is more user friendly than PIL and is a must have for anyone who works with images.
SQLAlchemy.
A database library.
Many love it and many hate it.
The choice is yours.
BeautifulSoup.
I know it's slow but this xml and html parsing library is very useful for beginners.
Twisted.
The most important tool for any network application developer.
It has a very beautiful api and is used by a lot of famous python developers.
NumPy.
How can we leave this very important library ? It provides some advance math functionalities to python.
SciPy.
When we talk about NumPy then we have to talk about scipy.
It is a library of algorithms and mathematical tools for python and has caused many scientists to switch from ruby to python.
matplotlib.
A numerical plotting library.
It is very useful for any data scientist or any data analyzer.
Pygame.
Which developer does not like to play games and develop them ? This library will help you achieve your goal of 2d game development.
Pyglet.
A 3d animation and game creation engine.
This is the engine in which the famous python port of minecraft was made
pyQT.
A GUI toolkit for python.
It is my second choice after wxpython for developing GUI's for my python scripts.
pyGtk.
Another python GUI library.
It is the same library in which the famous Bittorrent client is created.
Scapy.
A packet sniffer and analyzer for python made in python.
pywin32.
A python library which provides some useful methods and classes for interacting with windows.
nltk.
Natural Language Toolkit – I realize most people won’t be using this one, but it’s generic enough.
It is a very useful library if you want to manipulate strings.
But it's capacity is beyond that.
Do check it out.
nose.
A testing framework for python.
It is used by millions of python developers.
It is a must have if you do test driven development.
SymPy.
SymPy can do algebraic evaluation, differentiation, expansion, complex numbers, etc.
It is contained in a pure Python distribution.
IPython.
I just can’t stress enough how useful this tool is.
It is a python prompt on steroids.
It has completion, history, shell capabilities, and a lot more.
Make sure that you take a look at it.
网络连通性
import platform,os,traceback
#判断系统
def get_os():
os = platform.system()
if os == "Windows":
return "n"
else:
return "c"
#ping判断,成功返回OK,否则Down
def ping_ip2(ip_str):
try:
cmd = ["ping", "-{op}".format(op=get_os()), "1", ip_str]
# print(cmd)
output = os.popen(" ".join(cmd)).readlines()
# print(output)
flag = False
for line in list(output):
if not line:
continue
if str(line).upper().find("TTL") >= 0:
flag = True
break
if flag:
# print("%s OK\n"%(ip_str))
return "OK"
else:
# print("%s Down\n"%(ip_str))
return "Down"
except Exception as e:
print(traceback.format_exc())
端口状态测试
#给定IP,给出端口,默认为22端口,可进行相应的传参。
import sys,os,socket
def telnet_port_fun2(ip,port=22):
s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
res=s.connect_ex((ip,port))
s.close()
if res==0:
return 'OPEN'
else:
return 'CLOSE'
上传下载速率
from speedtest import Speedtest
def Testing_Speed(net):
download = net.download()
upload = net.upload()
print(f'下载速度: {download/(1024*1024)} Mbps')
print(f'上传速度: {upload/(1024*1024)} Mbps')
print("开始网速的测试 ...")
#进行调用
net = Speedtest()
Testing_Speed(net)
paramiko交互
#paramiko是ansible重要模板之一,支持SSH2远程安全连接,支持认证及密钥方式。可以实现远程命
令执行、文件传输、中间SSH代理等功能
import paramiko
cmd = "ls"
task_info = "ps -aux"
# 创建客户端对象
ssh = paramiko.SSHClient()
# 接收并保存新的主机名,此外还有RejectPolicy()拒绝未知的主机名
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
# hostname:目标主机地址,port:端口号,username:登录用户名,password:密码
ssh.connect(hostname="hostname", username="root", password="password",
port=22)
# 执行命令,timeout为此次会话的超时时间,返回的是(stdin, stdout, stderr)的三元组
stdin, stdout, stderr = ssh.exec_command(cmd, timeout=20)
# 需要解码才能把返回的内容转换为正常的字符串形式
print(stdout.read().decode())
linux ssh测试
import subprocess
def scan_port(ip, user, passwd):
cmd = "id"
# try:'{CMD}'
COMMAND = "timeout 10 sshpass -p '{PASSWD}' ssh -o
StrictHostKeyChecking=no {USER}@{IP} '{CMD}' ".format(
PASSWD=passwd, USER=user, IP=ip, CMD=cmd)
output = subprocess.Popen(COMMAND, shell=True, stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
oerr = output.stderr.readlines()
oout = output.stdout.readlines()
oinfo = oerr + oout
if len(oinfo) != 0:
oinfo = oinfo[0].decode()
else:
oinfo = '未知异常.'
if user in oinfo:
res = "{USER}登录正常".format(USER=user)
elif "reset" in oinfo:
res = "没加入白名单"
elif "Permission" in oinfo:
res = "{USER}密码错误".format(USER=user)
elif 'No route to host' in oinfo or ' port 22: Connection refused' in
oinfo:
res = '22端口不通'
else:
res = oinfo
# print(res,'============',oinfo)
return res
内存使用率
import psutil
def mem_use():
print('内存信息:')
mem=psutil.virtual_memory()
#换算为MB
memtotal=mem.total/1024/1024
memused=mem.used/1024/1024
mem_percent=str(mem.used/mem.total*100)+'%'
print('%.3fMB'%memused)
print('%.3fMB'%memtotal)
print(mem_percent)
CPU使用率
import psutil
import os
def get_cpu_mem():
pid = os.getpid()
p=psutil.Process(pid)
cpu_percent = p.cpu_percent()
mem_percent = p.memory_percent()
print("cpu:{:.2f}%,mem:{:.2f}%".format(cpu_percent,mem_percent))
获取nginx访问量前十IP
import matplotlib.pyplot as plt
nginx_file = 'file_path'
ip = {}
# 筛选nginx日志文件中的IP。
with open(nginx_file) as f:
for i in f.readlines():
s = i.strip().split()[0]
lengh = len(ip.keys())
if s in ip.keys():
ip[s] = ip[s] + 1
else:
ip[s] = 1
ip = sorted(ip.items(), key=lambda e: e[1], reverse=True)
# 取前十:
newip = ip[0:10:1]
tu = dict(newip)
x = []
y = []
for k in tu:
x.append(k)
y.append(tu[k])
plt.title('ip access')
plt.xlabel('ip address')
plt.ylabel('pv')
# X 轴项的翻转角度:
plt.xticks(rotation=70)
# 显示每个柱状图的值
for a, b in zip(x, y):
plt.text(a, b, '%.0f' % b, ha='center', va='bottom', fontsize=6)
plt.bar(x, y)
plt.legend()
plt.show()
操作MySQL
方法1:查询
import pymysql
# 创建连接
conn = pymysql.connect(host="127.0.0.1", port=3306, user='user',
passwd='passwd', db='db_name', charset='utf8mb4')
# 创建游标
cursor = conn.cursor()
# 存在sql注入情况(不要用格式化字符串的方式拼接SQL)
sql = "insert into USER (NAME) values('%s')" % ('zhangsan',)
effect_row = cursor.execute(sql)
# 正确方式一
# execute函数接受一个元组/列表作为SQL参数,元素个数只能有1个
sql = "insert into USER (NAME) values(%s)"
effect_row1 = cursor.execute(sql, ['value1'])
effect_row2 = cursor.execute(sql, ('value2',))
# 正确方式二
sql = "insert into USER (NAME) values(%(name)s)"
effect_row1 = cursor.execute(sql, {'name': 'value3'})
# 写入插入多行数据
effect_row2 = cursor.executemany("insert into USER (NAME) values(%s)",
[('value4'), ('value5')])
# 提交
conn.commit()
# 关闭游标
cursor.close()
方法2:增删查改
# coding=utf-8
import pymysql
from loguru import logger
from urllib import parse
from dbutils.pooled_db import PooledDB
from sqlalchemy import create_engine
class SqlHelper2(object):
global host, user, passwd, port
host = 'ip'
user = 'root'
passwd = 'passwd'
port = 3306
def __init__(self,db_name):
self.connect(db_name)
def connect(self,db_name):
self.conn = pymysql.connect(host=host, user=user,
passwd=passwd,port=port, db=db_name, charset='utf8mb4')
self.conn.ping(reconnect=True)
# self.cursor = self.conn.cursor(cursor=pymysql.cursors.DictCursor)
self.cursor = self.conn.cursor()
def get_list(self,sql):
try:
self.conn.ping(reconnect=True)#解决超时问题
self.cursor.execute(sql)
result = self.cursor.fetchall()
self.cursor.close()
except Exception as e:
self.conn.ping(reconnect=True)
self.cursor = self.conn.cursor()
self.cursor.execute(sql)
result = self.cursor.fetchall()
return result
def get_one(self, sql):
self.cursor.execute(sql)
result = self.cursor.fetchone()
return result
#提交数据
def modify(self,sql,args=[]):
try:
self.cursor.execute(sql,args)
self.conn.commit()
qk = "存入MySQL 成功"
except Exception as e:
# # 如果发生错误则回滚
qk = "存入MySQL 失败:"+str(e)
self.conn.rollback()
return qk
def multiple(self,sql,args=[]):
# executemany支持下面的操作,即一次添加多条数据
# self.cursor.executemany('sinsert into class(id,name)
values(%s,%s)', [(1,'wang'),(2,'li')])
try:
self.cursor.executemany(sql,args)
self.conn.commit()
qk = "存入MySQL 成功"
except Exception as e:
qk = "存入MySQL 失败:"+str(e)
self.conn.rollback()
return qk
def create(self,sql,args=[]):
self.cursor.execute(sql,args)
self.conn.commit()
return self.cursor.lastrowid
def close(self):
self.cursor.close()
self.conn.close()
xonsh python和shell交互
#Xonsh shell,为喜爱 Python 的 Linux 用户而打造。
#Xonsh 是一个使用 Python 编写的跨平台 shell 语言和命令提示符。
#它结合了 Python 和 Bash shell,因此你可以在这个 shell 中直接运行 Python 命令#(语句)。你甚至可以把 Python 命令和 shell 命令混合起来使用。
#pip install xonsh
xonsh #启动
#shell 部分
>>>$GOAL = 'Become the Lord of the Files'
>>>print($GOAL)
Become the Lord of the Files
>>>del $GOAL
#python 部分
d = {'xonsh': True}
d.get('bash', False)
>>>False
#
cpu内存使用率展示
#pyecharts是百度开源软件echarts的python集成包,可根据需求绘制各类图形。
#折线图 Line
from pyecharts.charts import Line
import pandas as pd
from pyecharts import options as opts
import random
#模拟数据,生成cpu使用率的折线图
x =list(pd.date_range('20220701','20220830'))
y=[random.randint(10,30) for i in range(len(x))]
z=[random.randint(5,20) for i in range(len(x))]
line = Line(init_opts = opts.InitOpts(width ='800px',height ='600px'))
line.add_xaxis(xaxis_data =x)
line.add_yaxis(series_name = 'cpu使用率',y_axis = y,is_smooth=True)
line.add_yaxis(series_name = '内存使用率',y_axis = z,is_smooth=True)
#添加参数,title_opts设置图的标题
line.set_global_opts(title_opts = opts.TitleOpts(title ='CPU和内存使用率折线
图'))
line.render()#生成一个render.html浏览器打开
#可根据上述脚本对linux主机采集的数据,存入到MySQL,最后通过python Django、flask、fastapi等web框架进行展示。
执行前需安装相应的依赖包:pip install xxx
10 个杀手级自动化Python 脚本
“自动化不是人类工人的敌人,而是盟友。
自动化将工人从苦差事中解放出来,让他有机会做更有创造力和更有价值的工作。
1、文件传输脚本
Python 中的文件传输脚本是一组用 Python 编程语言编写的指令或程序,用于自动执行通过网络或在计算机之间传输文件的过程。
Python 提供了几个可用于创建文件传输脚本的库和模块,例如套接字ftplib、smtplib 和paramiko 等。
下面是 Python 中一个简单的文件传输脚本示例,该脚本使用套接字模块通过网络传输文件:
import socket
# create socket
s = socket.socket()
# bind socket to a address and port
s.bind(('localhost', 12345))
# put the socket into listening mode
s.listen(5)
print('Server listening...')
# forever loop to keep server running
while True:
# establish connection with client
client, addr = s.accept()
print(f'Got connection from {addr}')
# receive the file name
file_name = client.recv(1024).decode()
try:
# open the file for reading in binary
with open(file_name, 'rb') as file:
# read the file in chunks
while True:
chunk = file.read(1024)
if not chunk:
break
# send the chunk to the client
client.sendall(chunk)
print(f'File {file_name} sent successfully')
except FileNotFoundError:
# if file not found, send appropriate message
client.sendall(b'File not found')
print(f'File {file_name} not found')
# close the client connection
client.close()
此脚本运行一个服务器,该服务器侦听地址 localhost 和端口 12345 上的传入连接。
当客户端连接时,服务器从客户端接收文件名,然后读取文件的内容并将其以块的形式发送到客户端。
如果未找到该文件,服务器将向客户端发送相应的消息。
如上所述,还有其他库和模块可用于在python中创建文件传输脚本,例如使用ftp协议连接和传输文件的ftplib和用于SFTP/SSH文件传输协议传输的paramiko。
可以定制脚本以匹配特定要求或方案。
2、系统监控脚本
系统监视脚本是一种 Python 脚本用于监视计算机或网络的性能和状态。
该脚本可用于跟踪各种指标,例如 CPU 使用率、内存使用率、磁盘空间、网络流量和系统正常运行时间。
该脚本还可用于监视某些事件或条件,例如错误的发生或特定服务的可用性。
例如:
import psutil
# Get the current CPU usage
cpu_usage = psutil.cpu_percent()
# Get the current memory usage
memory_usage = psutil.virtual_memory().percent
# Get the current disk usage
disk_usage = psutil.disk_usage("/").percent
# Get the network activity
# Get the current input/output data rates for each network interface
io_counters = psutil.net_io_counters(pernic=True)
for interface, counters in io_counters.items():
print(f"Interface {interface}:")
print(f" bytes sent: {counters.bytes_sent}")
print(f" bytes received: {counters.bytes_recv}")
# Get a list of active connections
connections = psutil.net_connections()
for connection in connections:
print(f"{connection.laddr} <-> {connection.raddr} ({connection.status})")
# Print the collected data
print(f"CPU usage: {cpu_usage}%")
print(f"Memory usage: {memory_usage}%")
print(f"Disk usage: {disk_usage}%")
此脚本使用psutil模块中的
cpu_percent: CPU 使用率
virtual_memory:内存使用率
disk_usage: 磁盘使用率。
函数分别检索当前:
virtual_memory 函数返回具有各种属性的对象,例如内存总量以及已用内存量和可用内存量。
disk_usage 函数将路径作为参数,并返回具有磁盘上总空间量以及已用空间量和可用空间量等属性的对象。
3、网页抓取脚本最常用
此脚本可用于从网站中提取数据并以结构化格式,如电子表格或数据库存储数据。
这对于收集数据进行分析或跟踪网站上的更改非常有用。
例如:
import requests
from bs4 import BeautifulSoup
# Fetch a web page
page = requests.get("http://www.example.com")
# Parse the HTML content
soup = BeautifulSoup(page.content, "html.parser")
# Find all the links on the page
links = soup.find_all("a")
# Print the links
for link in links:
print(link.get("href"))
可以看到BeautiulSoup的强大功能。
您可以使用此包找到任何类型的 dom 对象,因为我已经展示了如何找到页面上的所有链接。
您可以修改脚本以抓取其他类型的数据,或导航到站点的不同页面。
还可以使用 find 方法查找特定元素,或使用带有其他参数的 find_all 方法来筛选结果。
4、电子邮件自动化脚本
此脚本可用于根据特定条件自动发送电子邮件。
例如,您可以使用此脚本向团队发送每日报告,或者在重要截止日期临近时向自己发送提醒。
下面是如何使用 Python 发送电子邮件的示例:
import smtplib
from email.mime.text import MIMEText
# Set the SMTP server and login credentials
smtp_server = "smtp.gmail.com"
smtp_port = 587
username = "your@email.com"
password = "yourpassword"
# Set the email parameters
recipient = "recipient@email.com"
subject = "Test email from Python"
body = "This is a test email sent from Python."
# Create the email message
msg = MIMEText(body)
msg["Subject"] = subject
msg["To"] = recipient
msg["From"] = username
# Send the email
server = smtplib.SMTP(smtp_server, smtp_port)
server.starttls()
server.login(username, password)
server.send_message(msg)
server.quit()
此脚本使用 smtplib 和电子邮件模块通过简单邮件传输协议 SMTP 发送电子邮件。
来自smtplib模块的SMTP类用于创建SMTP客户端,starttls和登录方法用于建立安全连接,电子邮件模块中的MIMEText类用于创建多用途Internet邮件扩展MIME格式的电子邮件。
MIMEText 构造函数将电子邮件的正文作为参数,您可以使用 setitem 方法来设置电子邮件的主题、收件人和发件人。
创建电子邮件后,SMTP 对象的send_message方法将用于发送电子邮件。
然后调用 quit 方法以关闭与 SMTP 服务器的连接。
5、密码管理器脚本:
密码管理器脚本是一种用于安全存储和管理密码的 Python 脚本。
该脚本通常包括用于生成随机密码、将哈希密码存储在安全位置如数据库或文件以及在需要时检索密码的函数。
import secrets
import string
# Generate a random password
def generate_password(length=16):
characters = string.ascii_letters + string.digits + string.punctuation
password = "".join(secrets.choice(characters) for i in range(length))
return password
# Store a password in a secure way
def store_password(service, username, password):
# Use a secure hashing function to store the password
hashed_password = hash_function(password)
# Store the hashed password in a database or file
with open("password_database.txt", "a") as f:
f.write(f"{service},{username},{hashed_password}\n")
# Retrieve a password
def get_password(service, username):
# Look up the hashed password in the database or file
with open("password_database.txt") as f:
for line in f:
service_, username_, hashed_password_ = line.strip().split(",")
if service == service_ and username == username_:
# Use a secure hashing function to compare the stored password with the provided password
if hash_function(password) == hashed_password_:
return password
return None
上述示例脚本中的generate_password 函数使用字母、数字和标点字符的组合生成指定长度的随机密码。
store_password函数将服务,如网站或应用程序、用户名和密码作为输入,并将散列密码存储在安全位置。
get_password函数将服务和用户名作为输入,如果在安全存储位置找到相应的密码,则检索相应的密码。
自动化的 Python 脚本的第 2 部分
欢迎回来!
在上一篇文章中,我们深入研究了 Python 脚本的世界,我们还没有揭开Python脚本的所有奥秘。
在本期中,我们将发现其余五种类型的脚本,这些脚本将让您立即像专业人士一样编码。
6、自动化数据分析:
Python的pandas是数据分析和操作的强大工具。
以下脚本演示如何使用它自动执行清理、转换和分析数据集的过程。
import pandas as pd
# Reading a CSV file
df = pd.read_csv("data.csv")
# Cleaning data
df.dropna(inplace=True) # Dropping missing values
df = df[df["column_name"] != "some_value"] # Removing specific rows
# Transforming data
df["column_name"] = df["column_name"].str.lower() # Changing string to lowercase
df["column_name"] = df["column_name"].astype(int) # Changing column datatype
# Analyzing data
print(df["column_name"].value_counts()) # Prints the frequency of unique values in the column
# Saving the cleaned and transformed data to a new CSV file
df.to_csv("cleaned_data.csv", index=False)
上面脚本中的注释对于具有 python 基础知识的人来说非常简单。
该脚本是一个简单的示例,用于演示 pandas 库的强大功能以及如何使用它来自动执行数据清理、转换和分析任务。
但是,脚本是有限的,在实际方案中,数据集可能要大得多,清理、转换和分析操作可能会更复杂。
7、自动化计算机视觉任务:
自动化计算机视觉任务是指使用 Python 及其库自动执行各种图像处理和计算机视觉操作。
Python 中最受欢迎的计算机视觉任务库之一是opencv
OpenCV是一个主要针对实时计算机视觉的编程函数库。
它提供了广泛的功能,包括图像和视频 I/O、图像处理、视频分析、对象检测和识别等等。
例如:
import cv2
# Load the cascade classifier for face detection
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
# Load the image
img = cv2.imread("image.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Detect faces
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)
# Draw rectangles around the faces
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
# Show the image
cv2.imshow("Faces", img)
cv2.waitKey(0)
cv2.destroyAllWindows()
上面的脚本检测图像中的人脸。
它首先加载一个级联分类器用于人脸检测,这个分类器是一个预先训练的模型,可以识别图像中的人脸。
然后它加载图像并使用 cv2.cvtColor()方法将其转换为灰度。
然后将图像传递给分类器的 detectMultiScale()方法,该方法检测图像中的人脸。
该方法返回检测到的人脸的坐标列表。
然后,该脚本循环遍历坐标列表,并使用 cv2.rectangle()方法在检测到的人脸周围绘制矩形。
最后,使用 cv2.imshow()方法在屏幕上显示图像。
这只是OpenCV可以实现的目标的一个基本示例,还有更多可以自动化的功能,例如对象检测,图像处理和视频分析。
OpenCV 是一个非常强大的库,可用于自动执行各种计算机视觉任务,例如面部识别、对象跟踪和图像稳定。
8、自动化数据加密:
自动化数据加密是指使用 Python 及其库自动加密和解密数据和文件。
Python 中最受欢迎的数据加密库之一是密码学。
“密码学”是一个提供加密配方和原语的库。
它包括高级配方和常见加密算法(如对称密码、消息摘要和密钥派生函数)的低级接口。
以下示例演示了如何使用加密库加密文件:
import os
from cryptography.fernet import Fernet
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
password = b"super_secret_password"
salt = os.urandom(16)
kdf = PBKDF2HMAC(
algorithm=hashes.SHA256,
iterations=100000,
length=32,
salt=salt,
backend=default_backend()
)
key = base64.urlsafe_b64encode(kdf.derive(password))
cipher = Fernet(key)
# Encrypt the file
with open("file.txt", "rb") as f:
data = f.read()
cipher_text = cipher.encrypt(data)
with open("file.txt", "wb") as f:
f.write(cipher_text)
它首先使用 PBKDF2HMAC 密钥派生函数生成密钥,这是一个基于密码的密钥派生函数,使用安全哈希算法 SHA-256 和salt值。
salt 值是使用os.urandom()函数生成的,该函数生成加密安全的随机字节。
然后,它创建一个 Fernet 对象,该对象是对称(也称为“密钥”)身份验证加密的实现。
然后,它读取明文文件,并使用 Fernet 对象的encrypt()方法对其进行加密。
最后,它将加密数据写入文件。
请务必注意,用于加密文件的密钥必须保密并安全存储。
如果密钥丢失或泄露,加密的数据将无法读取。
9、自动化测试和调试:
自动化测试和调试是指使用 Python 及其库自动运行测试和调试代码。
在 Python 中,有几个流行的库用于自动化测试和调试,例如 unittest、pytest、nose 和 doctest。
下面是使用unittest 库自动测试在给定字符串中查找最长回文子字符串的 Python 函数的示例:
def longest_palindrome(s):
n = len(s)
ans = ""
for i in range(n):
for j in range(i+1, n+1):
substring = s[i:j]
if substring == substring[::-1] and len(substring) > len(ans):
ans = substring
return ans
class TestLongestPalindrome(unittest.TestCase):
def test_longest_palindrome(self):
self.assertEqual(longest_palindrome("babad"), "bab")
self.assertEqual(longest_palindrome("cbbd"), "bb")
self.assertEqual(longest_palindrome("a"), "a")
self.assertEqual(longest_palindrome(""), "")
if __name__ == '__main__':
unittest.main()
此脚本使用 unittest 库自动测试在给定字符串中查找最长回文子字符串的 Python 函数。
'longest_palindrome' 函数将字符串作为输入,并通过遍历所有可能的子字符串并检查它是否是回文并且它的长度大于前一个来返回最长的回文子字符串。
该脚本还定义了一个从 unittest 继承的“TestLongestPalindrome”类。
测试用例,并包含多种测试方法。
每个测试方法都使用 assertEqual()方法来检查 longest_palindrome() 函数的输出是否等于预期的输出。
当脚本运行时,将调用unittest.main()函数,该函数运行TestLongestPalindrome类中的所有测试方法。
如果任何测试失败,即longest_palindrome()函数的输出不等于预期输出,则会打印一条错误消息,指示哪个测试失败以及预期和实际输出是什么。
此脚本是如何使用 unittest 库自动测试 Python 函数的示例。
它允许您在将代码部署到生产环境之前轻松测试代码并捕获任何错误或错误。
10、自动化时间序列预测:
自动化时间序列预测是指使用 Python 及其库自动预测时间序列数据的未来值。
在Python中,有几个流行的库可以自动化时间序列预测,例如statsmodels和prophet。
“prophet”是由Facebook开发的开源库,它提供了一种简单快捷的方式来执行时间序列预测。
它基于加法模型,其中非线性趋势与每年、每周和每天的季节性以及假日效应相吻合。
它最适合具有强烈季节性影响的时间序列和多个季节的历史数据。
下面是使用 prophet 库对每日销售数据执行时间序列预测的示例:
import pandas as pd
from fbprophet import Prophet
# Read in data
df = pd.read_csv("sales_data.csv")
# Create prophet model
model = Prophet()
# Fit model to data
model.fit(df)
# Create future dataframe
future_data = model.make_future_dataframe(periods=365)
# Make predictions
forecast = model.predict(future_data)
# Print forecast dataframe
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
正如Mr.所说:一张图片胜过千言万语
还可以通过在上面添加以下代码行来包含预测销售额的视觉对象:
# Import visualization library
import matplotlib.pyplot as plt
# Plot predicted values
model.plot(forecast)
plt.show()
# Plot predicted values with uncertainty intervals
model.plot(forecast)
plt.fill_between(forecast['ds'], forecast['yhat_lower'], forecast['yhat_upper'], color='pink')
plt.show()
# Plot component of the forecast
model.plot_components(forecast)
plt.show()
第一个可视化效果
model.plot(forecast)
显示预测值和历史数据,它可以让您大致了解模型拟合数据的程度。
第二个可视化效果:
plt.fill_between(预测['ds'],预测['yhat_lower'],预测['yhat_upper'],color='pink')
显示具有不确定性区间的预测值,这使您可以查看预测中有多少不确定性。
第三个可视化效果
model.plot_components(forecast)
显示预测的组成部分,例如趋势、季节性和节假日。
Open file in append mode using open() built-in function.
Call print statement.
f = open('my_log.txt', 'a')
print("Hello World", file=f)
The syntax to print a list to the file is
print(['apple', 'banana'], file=f)
Similarly we can print other data types to a file as well.
text = "Hello World! Welcome to new world."
print(text, file=f)
# Print tuple to file
print(('apple', 25), file=f)
# Print set to file
print({'a', 'e', 'i', 'o', 'u'}, file=f)
# Print dictionary to file
print({'mac' : 25, 'sony' : 22}, file=f)
Search and Replace in Excel File using Python
get url and save file
Web Scraping with Selenium and Python
import requests
url = "https://www.geeksforgeeks.org/sql-using-python/"
#just a random link of a dummy file
r = requests.get(url)
#retrieving data from the URL using get method
with open("test.html", 'wb') as f:
#giving a name and saving it in any required format
#opening the file in write mode
f.write(r.content)
#writes the URL contents from the server
print("test.html file created: ")
How to set up a local HTTP server
Run the following commands to start a local HTTP server:
# If python -V returned 2.X.X
python -m SimpleHTTPServer
# If python -V returned 3.X.X
python3 -m http.server
# Note that on Windows you may need to run python -m http.server instead of python3 -m http.server
You'll notice that both commands look very different – one calls SimpleHTTPServer and the other http.server.
This is just because the SimpleHTTPServer module was rolled into Python's http.server in Python 3.
They both work the same way.
Now when you go to http://localhost:8000/ you should see a list of all the files in your directory.
Then you can just click on the HTML file you want to view.
Just keep in mind that SimpleHTTPServer and http.server are only for testing things locally.
They only do very basic security checks and shouldn't be used in production.
How to send files locally
To set up a sort of quick and dirty NAS (Network Attached Storage) system:
Make sure both computers are connected through same network via LAN or WiFi
Open your command prompt or terminal and run python -V to make sure Python is installed
Go to the directory whose file you want to share by using cd (change directory) command.
Go to the directory with the file you want to share using cd on *nix or MacOS systems or CD for Windows
Start your HTTP server with either python -m SimpleHTTPServer or python3 -m http.server
Open new terminal and type ifconfig on *nix or MacOS or ipconfig on Windows to find your IP address
Now on the second computer or device:
Open browser and type in the IP address of the first machine, along with port 8000: http://[ip address]:8000
A page will open showing all the files in the directory being shared from the first computer.
If the page is taking too long to load, you may need to adjust the firewall settings on the first computer.
Python provides various ways of dealing with types of arguments. The three most common are:
import sys
# total arguments
n = len(sys.argv)
print("Total arguments passed:", n)
# Arguments passed
print("\nName of Python script:", sys.argv[0])
print("\nArguments passed:", end = " ")
for i in range(1, n):
import requests
from bs4 import BeautifulSoup
URL = "http://www.guancha.cn/"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
filename = 'temp.html'
f = open(filename, "a", encoding = "utf-8")
f.write(str(soup.prettify())) # write() argument must be str
f.close()
import sys
# total arguments
n = len(sys.argv)
print("\n\n\nTotal arguments passed:", n)
# Arguments passed
print("Name of Python script:", sys.argv[0])
print("\nArguments passed:", end = "\n")
for i in range(0, n):
print("Arguments ", i , " ", sys.argv[i])
# Addition of numbers
Sum = 0
# Using argparse module
for i in range(1, n):
Sum += int(sys.argv[i])
print("\n\nResult:", Sum)
from openpyxl import Workbook
wb = Workbook()
# grab the active worksheet
ws = wb.active
# Data can be assigned directly to cells
ws['A1'] = 42
# Rows can also be appended
ws.append([1, 2, 3])
# Python types will automatically be converted
import datetime
ws['A2'] = datetime.datetime.now()
# Save the file
wb.save("sample.xlsx")
Read XLSM File
# Import the Pandas libraray as pd
import pandas as pd
# Read xlsm file
df = pd.read_excel("score.xlsm",sheet_name='Sheet1',index_col=0)
# Display the Data
print(df)
Read data from the Excel file
import pandas as pd
excel_file = 'movies.xls'
movies = pd.read_excel(excel_file)
movies.head()
movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0)
movies_sheet1.head()
movies_sheet2 = pd.read_excel(excel_file, sheetname=1, index_col=0)
movies_sheet2.head()
movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0)
movies_sheet3.head()
movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])
movies.shape
xlsx = pd.ExcelFile(excel_file)
movies_sheets = []
for sheet in xlsx.sheet_names:
movies_sheets.append(xlsx.parse(sheet))
movies = pd.concat(movies_sheets)
Automate Excel in Python
import openpyxl as xl
from openpyxl.chart import BarChart, Reference
wb = xl.load_workbook('python-spreadsheet.xlsx')
sheet = wb['Sheet1']
for row in range(2, sheet.max_row + 1):
cell = sheet.cell(row, 3)
corrected_price = float(cell.value.replace('$','')) * 0.9
corrected_price_cell = sheet.cell(row, 4)
corrected_price_cell.value = corrected_price
values = Reference(sheet, min_row=2, max_row=sheet.max_row, min_col=4, max_col=4)
chart = BarChart()
chart.add_data(values)
sheet.add_chart(chart, 'e2')
wb.save('python-spreadsheet2.xls')
# Make it work for several spreadsheets, move the code inside a function
def process_workbook(filename):
wb = xl.load_workbook(filename)
sheet = wb['Sheet1']
for row in range(2, sheet.max_row + 1):
cell = sheet.cell(row, 3)
corrected_price = float(cell.value.replace('$', '')) * 0.9
corrected_price_cell = sheet.cell(row, 4)
corrected_price_cell.value = corrected_price
values = Reference(sheet, min_row=2, max_row=sheet.max_row, min_col=4, max_col=4)
chart = BarChart()
chart.add_data(values)
sheet.add_chart(chart, 'e2')
wb.save(filename)
https://openpyxl.readthedocs.io/en/stable/
OpenPyXL is not your only choice.
There are several other packages that support Microsoft Excel:
xlrd – For reading older Excel (.xls) documents
xlwt – For writing older Excel (.xls) documents
xlwings – Works with new Excel formats and has macro capabilities
A couple years ago, the first two used to be the most popular libraries to use with Excel documents.
However, the author of those packages has stopped supporting them.
The xlwings package has lots of promise, but does not work on all platforms and requires that Microsoft Excel is installed.
You will be using OpenPyXL in this article because it is actively developed and supported.
OpenPyXL doesn’t require Microsoft Excel to be installed, and it works on all platforms.
You can install OpenPyXL using pip:
$ python -m pip install openpyxl
After the installation has completed, let’s find out how to use OpenPyXL to read an Excel spreadsheet!
Getting Sheets from a Workbook
⇧
The first step is to find an Excel file to use with OpenPyXL.
There is a books.xlsx file that is provided for you in this book’s Github repository.
You can download it by going to this URL:
https://github.com/driscollis/python101code/tree/master/chapter38_excel
Feel free to use your own file, although the output from your own file won’t match the sample output in this book.
The next step is to write some code to open the spreadsheet.
To do that, create a new file named open_workbook.py and add this code to it:
# open_workbook.py
from openpyxl import
def open_workbook()
load_workbook()
print('Worksheet names: {workbook.sheetnames}')
print()
print('The title of the Worksheet is: {sheet.title}')
if __name__ '__main__'
open_workbook( 'books.xlsx')
# open_workbook.py
from openpyxl import load_workbook
def open_workbook(path):
workbook = load_workbook(filename=path)
print(f'Worksheet names: {workbook.sheetnames}')
sheet = workbook.active
print(sheet)
print(f'The title of the Worksheet is: {sheet.title}')
if __name__ == '__main__':
open_workbook('books.xlsx')
# open_workbook.py
from openpyxl import load_workbook
def open_workbook(path):
workbook = load_workbook(filename=path)
print(f'Worksheet names: {workbook.sheetnames}')
sheet = workbook.active
print(sheet)
print(f'The title of the Worksheet is: {sheet.title}')
if __name__ == '__main__':
open_workbook('books.xlsx')
In this example, you import load_workbook() from openpyxl and then create open_workbook() which takes in the path to your Excel spreadsheet.
Next, you use load_workbook() to create an openpyxl.workbook.workbook.Workbook object.
This object allows you to access the sheets and cells in your spreadsheet.
And yes, it really does have the double workbook in its name.
That’s not a typo!
The rest of the open_workbook() function demonstrates how to print out all the currently defined sheets in your spreadsheet, get the currently active sheet and print out the title of that sheet.
When you run this code, you will see the following output:
[ 'Sheet 1 - Books' ]
< "Sheet 1 - Books" >
of 1
Worksheet names: ['Sheet 1 - Books']
<Worksheet "Sheet 1 - Books">
The title of the Worksheet is: Sheet 1 - Books
Worksheet names: ['Sheet 1 - Books']
<Worksheet "Sheet 1 - Books">
The title of the Worksheet is: Sheet 1 - Books
Now that you know how to access the sheets in the spreadsheet, you are ready to move on to accessing cell data!
Reading Cell Data
⇧
When you are working with Microsoft Excel, the data is stored in cells.
You need a way to access those cells from Python to be able to extract that data.
OpenPyXL makes this process straight-forward.
Create a new file named workbook_cells.py and add this code to it:
# workbook_cells.py
from openpyxl import
def get_cell_info ()
load_workbook()
print()
print('The title of the Worksheet is: {sheet.title}')
print('The value of {sheet["A2"].value=}')
print('The value of {sheet["A3"].value=}')
[ 'B3' ]
print('{cell.value=}')
if __name__ '__main__'
get_cell_info ( 'books.xlsx')
# workbook_cells.py
from openpyxl import load_workbook
def get_cell_info(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
print(sheet)
print(f'The title of the Worksheet is: {sheet.title}')
print(f'The value of {sheet["A2"].value=}')
print(f'The value of {sheet["A3"].value=}')
cell = sheet['B3']
print(f'{cell.value=}')
if __name__ == '__main__':
get_cell_info('books.xlsx')
# workbook_cells.py
from openpyxl import load_workbook
def get_cell_info(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
print(sheet)
print(f'The title of the Worksheet is: {sheet.title}')
print(f'The value of {sheet["A2"].value=}')
print(f'The value of {sheet["A3"].value=}')
cell = sheet['B3']
print(f'{cell.value=}')
if __name__ == '__main__':
get_cell_info('books.xlsx')
This code will load up the Excel file in an OpenPyXL workbook.
You will grab the active sheet and then print out its title and a couple of different cell values.
You can access a cell by using the sheet object followed by square brackets with the column name and row number inside of it.
For example, sheet["A2"] will get you the cell at column “A”, row 2.
To get the value of that cell, you use the value attribute.
Note: This code is using a new feature that was added to f-strings in Python 3.8.
If you run this with an earlier version, you will receive an error.
When you run this code, you will get this output:
< "Sheet 1 - Books" >
of 1
of [ "A2" ] value 'Title'
of [ "A3" ] value 'Python 101'
value 'Mike Driscoll'
<Worksheet "Sheet 1 - Books">
The title of the Worksheet is: Sheet 1 - Books
The value of sheet["A2"].value='Title'
The value of sheet["A3"].value='Python 101'
cell.value='Mike Driscoll'
<Worksheet "Sheet 1 - Books">
The title of the Worksheet is: Sheet 1 - Books
The value of sheet["A2"].value='Title'
The value of sheet["A3"].value='Python 101'
cell.value='Mike Driscoll'
You can get additional information about a cell using some of its other attributes.
Add the following function to your file and update the conditional statement at the end to run it:
def get_info_by_coord ()
load_workbook()
[ 'A2' ]
print('Row {cell.row}, Col {cell.column} = {cell.value}')
print('{cell.value=} is at {cell.coordinate=}')
if __name__ '__main__'
get_info_by_coord ( 'books.xlsx')
def get_info_by_coord(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
cell = sheet['A2']
print(f'Row {cell.row}, Col {cell.column} = {cell.value}')
print(f'{cell.value=} is at {cell.coordinate=}')
if __name__ == '__main__':
get_info_by_coord('books.xlsx')
def get_info_by_coord(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
cell = sheet['A2']
print(f'Row {cell.row}, Col {cell.column} = {cell.value}')
print(f'{cell.value=} is at {cell.coordinate=}')
if __name__ == '__main__':
get_info_by_coord('books.xlsx')
In this example, you use the row and column attributes of the cell object to get the row and column information.
Note that column “A” maps to “1”, “B” to “2”, etcetera.
If you were to iterate over the Excel document, you could use the coordinate attribute to get the cell name.
When you run this code, the output will look like this:
2 1
value 'Title' coordinate 'A2'
Row 2, Col 1 = Title
cell.value='Title' is at cell.coordinate='A2'
Row 2, Col 1 = Title
cell.value='Title' is at cell.coordinate='A2'
Speaking of iterating, let’s find out how to do that next!
Iterating Over Rows and Columns
⇧
Sometimes you will need to iterate over the entire Excel spreadsheet or portions of the spreadsheet.
OpenPyXL allows you to do that in a few different ways.
Create a new file named iterating_over_cells.py and add the following code to it:
# iterating_over_cells.py
from openpyxl import
def iterating_range ()
load_workbook()
for in [ 'A' ]
print()
if __name__ '__main__'
iterating_range ( 'books.xlsx')
# iterating_over_cells.py
from openpyxl import load_workbook
def iterating_range(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
for cell in sheet['A']:
print(cell)
if __name__ == '__main__':
iterating_range('books.xlsx')
# iterating_over_cells.py
from openpyxl import load_workbook
def iterating_range(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
for cell in sheet['A']:
print(cell)
if __name__ == '__main__':
iterating_range('books.xlsx')
Here you load up the spreadsheet and then loop over all the cells in column “A”.
For each cell, you print out the cell object.
You could use some of the cell attributes you learned about in the previous section if you wanted to format the output more granularly.
This what you get from running this code:
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
< 'Sheet 1 - Books' >
# output truncated for brevity
<Cell 'Sheet 1 - Books'.A1>
<Cell 'Sheet 1 - Books'.A2>
<Cell 'Sheet 1 - Books'.A3>
<Cell 'Sheet 1 - Books'.A4>
<Cell 'Sheet 1 - Books'.A5>
<Cell 'Sheet 1 - Books'.A6>
<Cell 'Sheet 1 - Books'.A7>
<Cell 'Sheet 1 - Books'.A8>
<Cell 'Sheet 1 - Books'.A9>
<Cell 'Sheet 1 - Books'.A10>
# output truncated for brevity
<Cell 'Sheet 1 - Books'.A1>
<Cell 'Sheet 1 - Books'.A2>
<Cell 'Sheet 1 - Books'.A3>
<Cell 'Sheet 1 - Books'.A4>
<Cell 'Sheet 1 - Books'.A5>
<Cell 'Sheet 1 - Books'.A6>
<Cell 'Sheet 1 - Books'.A7>
<Cell 'Sheet 1 - Books'.A8>
<Cell 'Sheet 1 - Books'.A9>
<Cell 'Sheet 1 - Books'.A10>
# output truncated for brevity
The output is truncated as it will print out quite a few cells by default.
OpenPyXL provides other ways to iterate over rows and columns by using the iter_rows() and iter_cols() functions.
These methods accept several arguments:
min_row
max_row
min_col
max_col
You can also add on a values_only argument that tells OpenPyXL to return the value of the cell instead of the cell object.
Go ahead and create a new file named iterating_over_cell_values.py and add this code to it:
# iterating_over_cell_values.py
from openpyxl import
def iterating_over_values ()
load_workbook()
for in iter_rows (
1 3
1 3
True
)
print()
if __name__ '__main__'
iterating_over_values ( 'books.xlsx')
# iterating_over_cell_values.py
from openpyxl import load_workbook
def iterating_over_values(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
for value in sheet.iter_rows(
min_row=1, max_row=3,
min_col=1, max_col=3,
values_only=True,
):
print(value)
if __name__ == '__main__':
iterating_over_values('books.xlsx')
# iterating_over_cell_values.py
from openpyxl import load_workbook
def iterating_over_values(path):
workbook = load_workbook(filename=path)
sheet = workbook.active
for value in sheet.iter_rows(
min_row=1, max_row=3,
min_col=1, max_col=3,
values_only=True,
):
print(value)
if __name__ == '__main__':
iterating_over_values('books.xlsx')
This code demonstrates how you can use the iter_rows() to iterate over the rows in the Excel spreadsheet and print out the values of those rows.
When you run this code, you will get the following output:
( 'Books')
( 'Title' 'Author' 'Publisher')
( 'Python 101' 'Mike Driscoll' 'Mouse vs Python')
('Books', None, None)
('Title', 'Author', 'Publisher')
('Python 101', 'Mike Driscoll', 'Mouse vs Python')
('Books', None, None)
('Title', 'Author', 'Publisher')
('Python 101', 'Mike Driscoll', 'Mouse vs Python')
The output is a Python tuple that contains the data within each column.
At this point you have learned how to open spreadsheets and read data — both from specific cells, as well as through iteration.
You are now ready to learn how to use OpenPyXL to create Excel spreadsheets!
Writing Excel Spreadsheets
⇧
Creating an Excel spreadsheet using OpenPyXL doesn’t take a lot of code.
You can create a spreadsheet by using the Workbook() class.
Go ahead and create a new file named writing_hello.py and add this code to it:
# writing_hello.py
from openpyxl import
def create_workbook()
Workbook()
[ 'A1' ] 'Hello'
[ 'A2' ] 'from'
[ 'A3' ] 'OpenPyXL'
save ()
if __name__ '__main__'
create_workbook( 'hello.xlsx')
# writing_hello.py
from openpyxl import Workbook
def create_workbook(path):
workbook = Workbook()
sheet = workbook.active
sheet['A1'] = 'Hello'
sheet['A2'] = 'from'
sheet['A3'] = 'OpenPyXL'
workbook.save(path)
if __name__ == '__main__':
create_workbook('hello.xlsx')
# writing_hello.py
from openpyxl import Workbook
def create_workbook(path):
workbook = Workbook()
sheet = workbook.active
sheet['A1'] = 'Hello'
sheet['A2'] = 'from'
sheet['A3'] = 'OpenPyXL'
workbook.save(path)
if __name__ == '__main__':
create_workbook('hello.xlsx')
Here you instantiate Workbook() and get the active sheet.
Then you set the first three rows in column “A” to different strings.
Finally, you call save() and pass it the path to save the new document to.
Congratulations! You have just created an Excel spreadsheet with Python.
Let’s discover how to add and remove sheets in your Workbook next!
Adding and Removing Sheets
⇧
Many people like to organize their data across multiple Worksheets within the Workbook.
OpenPyXL supports the ability to add new sheets to a Workbook() object via its create_sheet() method.
Create a new file named creating_sheets.py and add this code to it:
# creating_sheets.py
import
def create_worksheets ()
Workbook()
print()
# Add a new worksheet
create_sheet ()
print()
# Insert a worksheet
create_sheet ( 1
'Second sheet')
print()
save ()
if __name__ '__main__'
create_worksheets ( 'sheets.xlsx')
# creating_sheets.py
import openpyxl
def create_worksheets(path):
workbook = openpyxl.Workbook()
print(workbook.sheetnames)
# Add a new worksheet
workbook.create_sheet()
print(workbook.sheetnames)
# Insert a worksheet
workbook.create_sheet(index=1,
title='Second sheet')
print(workbook.sheetnames)
workbook.save(path)
if __name__ == '__main__':
create_worksheets('sheets.xlsx')
# creating_sheets.py
import openpyxl
def create_worksheets(path):
workbook = openpyxl.Workbook()
print(workbook.sheetnames)
# Add a new worksheet
workbook.create_sheet()
print(workbook.sheetnames)
# Insert a worksheet
workbook.create_sheet(index=1,
title='Second sheet')
print(workbook.sheetnames)
workbook.save(path)
if __name__ == '__main__':
create_worksheets('sheets.xlsx')
Here you use create_sheet() twice to add two new Worksheets to the Workbook.
The second example shows you how to set the title of a sheet and at which index to insert the sheet.
The argument index=1 means that the worksheet will be added after the first existing worksheet, since they are indexed starting at 0.
When you run this code, you will see the following output:
[ 'Sheet' ]
[ 'Sheet' 'Sheet1' ]
[ 'Sheet' 'Second sheet' 'Sheet1' ]
['Sheet']
['Sheet', 'Sheet1']
['Sheet', 'Second sheet', 'Sheet1']
['Sheet']
['Sheet', 'Sheet1']
['Sheet', 'Second sheet', 'Sheet1']
You can see that the new sheets have been added step-by-step to your Workbook.
After saving the file, you can verify that there are multiple Worksheets by opening Excel or another Excel-compatible application.
After this automated worksheet-creation process, you’ve suddenly got too many sheets, so let’s get rid of some.
There are two ways to remove a sheet.
Go ahead and create delete_sheets.py to see how to use Python’s del keyword for removing worksheets:
# delete_sheets.py
import
def create_worksheets ()
Workbook()
create_sheet ()
# Insert a worksheet
create_sheet ( 1
'Second sheet')
print()
del [ 'Second sheet' ]
print()
save ()
if __name__ '__main__'
create_worksheets ( 'del_sheets.xlsx')
# delete_sheets.py
import openpyxl
def create_worksheets(path):
workbook = openpyxl.Workbook()
workbook.create_sheet()
# Insert a worksheet
workbook.create_sheet(index=1,
title='Second sheet')
print(workbook.sheetnames)
del workbook['Second sheet']
print(workbook.sheetnames)
workbook.save(path)
if __name__ == '__main__':
create_worksheets('del_sheets.xlsx')
# delete_sheets.py
import openpyxl
def create_worksheets(path):
workbook = openpyxl.Workbook()
workbook.create_sheet()
# Insert a worksheet
workbook.create_sheet(index=1,
title='Second sheet')
print(workbook.sheetnames)
del workbook['Second sheet']
print(workbook.sheetnames)
workbook.save(path)
if __name__ == '__main__':
create_worksheets('del_sheets.xlsx')
This code will create a new Workbook and then add two new Worksheets to it.
Then it uses Python’s del keyword to delete workbook['Second sheet'].
You can verify that it worked as expected by looking at the print-out of the sheet list before and after the del command:
[ 'Sheet' 'Second sheet' 'Sheet1' ]
[ 'Sheet' 'Sheet1' ]
['Sheet', 'Second sheet', 'Sheet1']
['Sheet', 'Sheet1']
['Sheet', 'Second sheet', 'Sheet1']
['Sheet', 'Sheet1']
The other way to delete a sheet from a Workbook is to use the remove() method.
Create a new file called remove_sheets.py and enter this code to learn how that works:
# remove_sheets.py
def remove_worksheets ()
Workbook()
create_sheet ()
# Insert a worksheet
create_sheet ( 1
'Second sheet')
print(sheetnames)
remove ()
print(sheetnames)
save ()
if '__main__'
remove_worksheets ( 'remove_sheets.xlsx')
# remove_sheets.py
import openpyxl
def remove_worksheets(path):
workbook = openpyxl.Workbook()
sheet1 = workbook.create_sheet()
# Insert a worksheet
workbook.create_sheet(index=1,
title='Second sheet')
print(workbook.sheetnames)
workbook.remove(sheet1)
print(workbook.sheetnames)
workbook.save(path)
if __name__ == '__main__':
remove_worksheets('remove_sheets.xlsx')
# remove_sheets.py
import openpyxl
def remove_worksheets(path):
workbook = openpyxl.Workbook()
sheet1 = workbook.create_sheet()
# Insert a worksheet
workbook.create_sheet(index=1,
title='Second sheet')
print(workbook.sheetnames)
workbook.remove(sheet1)
print(workbook.sheetnames)
workbook.save(path)
if __name__ == '__main__':
remove_worksheets('remove_sheets.xlsx')
This time around, you hold onto a reference to the first Worksheet that you create by assigning the result to sheet1.
Then you remove it later on in the code.
Alternatively, you could also remove that sheet by using the same syntax as before, like this:
remove ( [ 'Sheet1' ])
workbook.remove(workbook['Sheet1'])
workbook.remove(workbook['Sheet1'])
No matter which method you choose for removing the Worksheet, the output will be the same:
[ 'Sheet' 'Second sheet' 'Sheet1' ]
[ 'Sheet' 'Second sheet' ]
['Sheet', 'Second sheet', 'Sheet1']
['Sheet', 'Second sheet']
['Sheet', 'Second sheet', 'Sheet1']
['Sheet', 'Second sheet']
Now let’s move on and learn how you can add and remove rows and columns.
Adding and Deleting Rows and Columns
⇧
OpenPyXL has several useful methods that you can use for adding and removing rows and columns in your spreadsheet.
Here is a list of the four methods you will learn about in this section:
idx – The index to insert the row or column
amount – The number of rows or columns to add
To see how this works, create a file named insert_demo.py and add the following code to it:
# insert_demo.py
from openpyxl import
def inserting_cols_rows ()
Workbook()
[ 'A1' ] 'Hello'
[ 'A2' ] 'from'
[ 'A3' ] 'OpenPyXL'
# insert a column before A
insert_cols ( 1)
# insert 2 rows starting on the second row
insert_rows ( 2 2)
save ()
if __name__ '__main__'
inserting_cols_rows ( 'inserting.xlsx')
# insert_demo.py
from openpyxl import Workbook
def inserting_cols_rows(path):
workbook = Workbook()
sheet = workbook.active
sheet['A1'] = 'Hello'
sheet['A2'] = 'from'
sheet['A3'] = 'OpenPyXL'
# insert a column before A
sheet.insert_cols(idx=1)
# insert 2 rows starting on the second row
sheet.insert_rows(idx=2, amount=2)
workbook.save(path)
if __name__ == '__main__':
inserting_cols_rows('inserting.xlsx')
# insert_demo.py
from openpyxl import Workbook
def inserting_cols_rows(path):
workbook = Workbook()
sheet = workbook.active
sheet['A1'] = 'Hello'
sheet['A2'] = 'from'
sheet['A3'] = 'OpenPyXL'
# insert a column before A
sheet.insert_cols(idx=1)
# insert 2 rows starting on the second row
sheet.insert_rows(idx=2, amount=2)
workbook.save(path)
if __name__ == '__main__':
inserting_cols_rows('inserting.xlsx')
Here you create a Worksheet and insert a new column before column “A”.
Columns are indexed started at 1 while in contrast, worksheets start at 0.
This effectively moves all the cells in column A to column B.
Then you insert two new rows starting on row 2.
Now that you know how to insert columns and rows, it is time for you to discover how to remove them.
To find out how to remove columns or rows, create a new file named delete_demo.py and add this code:
# delete_demo.py
from openpyxl import
def deleting_cols_rows ()
Workbook()
[ 'A1' ] 'Hello'
[ 'B1' ] 'from'
[ 'C1' ] 'OpenPyXL'
[ 'A2' ] 'row 2'
[ 'A3' ] 'row 3'
[ 'A4' ] 'row 4'
# Delete column A
delete_cols ( 1)
# delete 2 rows starting on the second row
delete_rows ( 2 2)
save ()
if __name__ '__main__'
deleting_cols_rows ( 'deleting.xlsx')
# delete_demo.py
from openpyxl import Workbook
def deleting_cols_rows(path):
workbook = Workbook()
sheet = workbook.active
sheet['A1'] = 'Hello'
sheet['B1'] = 'from'
sheet['C1'] = 'OpenPyXL'
sheet['A2'] = 'row 2'
sheet['A3'] = 'row 3'
sheet['A4'] = 'row 4'
# Delete column A
sheet.delete_cols(idx=1)
# delete 2 rows starting on the second row
sheet.delete_rows(idx=2, amount=2)
workbook.save(path)
if __name__ == '__main__':
deleting_cols_rows('deleting.xlsx')
# delete_demo.py
from openpyxl import Workbook
def deleting_cols_rows(path):
workbook = Workbook()
sheet = workbook.active
sheet['A1'] = 'Hello'
sheet['B1'] = 'from'
sheet['C1'] = 'OpenPyXL'
sheet['A2'] = 'row 2'
sheet['A3'] = 'row 3'
sheet['A4'] = 'row 4'
# Delete column A
sheet.delete_cols(idx=1)
# delete 2 rows starting on the second row
sheet.delete_rows(idx=2, amount=2)
workbook.save(path)
if __name__ == '__main__':
deleting_cols_rows('deleting.xlsx')
This code creates text in several cells and then removes column A using delete_cols().
It also removes two rows starting on the 2nd row via delete_rows().
Being able to add and remove columns and rows can be quite useful when it comes to organizing your data.
A Guide to Excel Spreadsheets in Python With openpyxl
Python Packages for Excel
1. openpyxl
2. xlrd
3. xlsxwriter
4. xlwt
5. xlutils
$ pip install openpyxl
from openpyxl import Workbook
workbook = Workbook()
sheet = workbook.active
sheet["A1"] = "hello"
sheet["B1"] = "world!"
workbook.save(filename="hello_world.xlsx")
your first spreadsheet created!Download Dataset:Click here to download the dataset for the openpyxl exercise you'll be following in this tutorial.
A Simple Approach to Reading an Excel Spreadsheet
>>>
>>> from openpyxl import load_workbook
>>> workbook = load_workbook(filename="sample.xlsx")
>>> workbook.sheetnames
['Sheet 1']
>>> sheet = workbook.active
>>> sheet
<Worksheet "Sheet 1">
>>> sheet.title
'Sheet 1'
In the code above, you first open the spreadsheet sample.xlsx using load_workbook(), and then you can use workbook.sheetnames to see all the sheets you have available to work with.
After that, workbook.active selects the first available sheet and, in this case, you can see that it selects Sheet 1 automatically.
Using these methods is the default way of opening a spreadsheet, and you'll see it many times during this tutorial.
Now, after opening a spreadsheet, you can easily retrieve data from it like this:
>>>
>>> sheet["A1"]
<Cell 'Sheet 1'.A1>
>>> sheet["A1"].value
'marketplace'
>>> sheet["F10"].value
"G-Shock Men's Grey Sport Watch"
To return the actual value of a cell, you need to do .value.
Otherwise, you'll get the main Cell object.
You can also use the method .cell() to retrieve a cell using index notation.
Remember to add .value to get the actual value and not a Cell object:
>>>
>>> sheet.cell(row=10, column=6)
<Cell 'Sheet 1'.F10>
>>> sheet.cell(row=10, column=6).value
"G-Shock Men's Grey Sport Watch"
You can see that the results returned are the same, no matter which way you decide to go with.
However, in this tutorial, you'll be mostly using the first approach: ["A1"].
Note: Even though in Python you're used to a zero-indexed notation, with spreadsheets you'll always use a one-indexed notation where the first row or column always has index 1. The above shows you the quickest way to open a spreadsheet.
However, you can pass additional parameters to change the way a spreadsheet is loaded.
Additional Reading Options
There are a few arguments you can pass to load_workbook() that change the way a spreadsheet is loaded.
The most important ones are the following two Booleans:
read_only loads a spreadsheet in read-only mode allowing you to open very large Excel files.
data_only ignores loading formulas and instead loads only the resulting values.
Importing Data From a Spreadsheet
Now that you've learned the basics about loading a spreadsheet, it's about time you get to the fun part: the iteration and actual usage of the values within the spreadsheet.
This section is where you'll learn all the different ways you can iterate through the data, but also how to convert that data into something usable and, more importantly, how to do it in a Pythonic way.
Iterating Through the Data
There are a few different ways you can iterate through the data depending on your needs.
You can slice the data with a combination of columns and rows:
>>>
>>> sheet["A1:C2"]
((<Cell 'Sheet 1'.A1>, <Cell 'Sheet 1'.B1>, <Cell 'Sheet 1'.C1>),
(<Cell 'Sheet 1'.A2>, <Cell 'Sheet 1'.B2>, <Cell 'Sheet 1'.C2>))
You can get ranges of rows or columns:
>>>
>>> # Get all cells from column A
>>> sheet["A"]
(<Cell 'Sheet 1'.A1>,
<Cell 'Sheet 1'.A2>,
...
<Cell 'Sheet 1'.A99>,
<Cell 'Sheet 1'.A100>)
>>> # Get all cells for a range of columns
>>> sheet["A:B"]
((<Cell 'Sheet 1'.A1>,
<Cell 'Sheet 1'.A2>,
...
<Cell 'Sheet 1'.A99>,
<Cell 'Sheet 1'.A100>),
(<Cell 'Sheet 1'.B1>,
<Cell 'Sheet 1'.B2>,
...
<Cell 'Sheet 1'.B99>,
<Cell 'Sheet 1'.B100>))
>>> # Get all cells from row 5
>>> sheet[5]
(<Cell 'Sheet 1'.A5>,
<Cell 'Sheet 1'.B5>,
...
<Cell 'Sheet 1'.N5>,
<Cell 'Sheet 1'.O5>)
>>> # Get all cells for a range of rows
>>> sheet[5:6]
((<Cell 'Sheet 1'.A5>,
<Cell 'Sheet 1'.B5>,
...
<Cell 'Sheet 1'.N5>,
<Cell 'Sheet 1'.O5>),
(<Cell 'Sheet 1'.A6>,
<Cell 'Sheet 1'.B6>,
...
<Cell 'Sheet 1'.N6>,
<Cell 'Sheet 1'.O6>))
You'll notice that all of the above examples return a tuple.
If you want to refresh your memory on how to handle tuples in Python, check out the article on Lists and Tuples in Python.
There are also multiple ways of using normal Python generators to go through the data.
The main methods you can use to achieve this are:
.iter_rows()
.iter_cols()
Both methods can receive the following arguments:
min_row
max_row
min_col
max_col
These arguments are used to set boundaries for the iteration:
>>>
>>> for row in sheet.iter_rows(min_row=1,
... max_row=2,
... min_col=1,
... max_col=3):
... print(row)
(<Cell 'Sheet 1'.A1>, <Cell 'Sheet 1'.B1>, <Cell 'Sheet 1'.C1>)
(<Cell 'Sheet 1'.A2>, <Cell 'Sheet 1'.B2>, <Cell 'Sheet 1'.C2>)
>>> for column in sheet.iter_cols(min_row=1,
... max_row=2,
... min_col=1,
... max_col=3):
... print(column)
(<Cell 'Sheet 1'.A1>, <Cell 'Sheet 1'.A2>)
(<Cell 'Sheet 1'.B1>, <Cell 'Sheet 1'.B2>)
(<Cell 'Sheet 1'.C1>, <Cell 'Sheet 1'.C2>)
You'll notice that in the first example, when iterating through the rows using .iter_rows(), you get one tuple element per row selected.
While when using .iter_cols() and iterating through columns, you'll get one tuple per column instead.
One additional argument you can pass to both methods is the Boolean values_only.
When it's set to True, the values of the cell are returned, instead of the Cell object:
>>>
>>> for value in sheet.iter_rows(min_row=1,
... max_row=2,
... min_col=1,
... max_col=3,
... values_only=True):
... print(value)
('marketplace', 'customer_id', 'review_id')
('US', 3653882, 'R3O9SGZBVQBV76')
If you want to iterate through the whole dataset, then you can also use the attributes .rows or .columns directly, which are shortcuts to using .iter_rows() and .iter_cols() without any arguments:
>>>
>>> for row in sheet.rows:
... print(row)
(<Cell 'Sheet 1'.A1>, <Cell 'Sheet 1'.B1>, <Cell 'Sheet 1'.C1>
...
<Cell 'Sheet 1'.M100>, <Cell 'Sheet 1'.N100>, <Cell 'Sheet 1'.O100>)
These shortcuts are very useful when you're iterating through the whole dataset.
Manipulate Data Using Python's Default Data Structures
Now that you know the basics of iterating through the data in a workbook, let's look at smart ways of converting that data into Python structures.
As you saw earlier, the result from all iterations comes in the form of tuples.
However, since a tuple is nothing more than a
list that's immutable, you can easily access its data and transform it into other structures.
For example, say you want to extract product information from the sample.xlsx spreadsheet and into a dictionary where each key is a product ID.
A straightforward way to do this is to iterate over all the rows, pick the columns you know are related to product information, and then store that in a dictionary.
Let's code this out!
First of all, have a look at the headers and see what information you care most about:
>>>
>>> for value in sheet.iter_rows(min_row=1,
... max_row=1,
... values_only=True):
... print(value)
('marketplace', 'customer_id', 'review_id', 'product_id', ...)
This code returns a list of all the column names you have in the spreadsheet.
To start, grab the columns with names:
product_id
product_parent
product_title
product_category
Lucky for you, the columns you need are all next to each other so you can use the min_column and max_column to easily get the data you want:
>>>
>>> for value in sheet.iter_rows(min_row=2,
... min_col=4,
... max_col=7,
... values_only=True):
... print(value)
('B00FALQ1ZC', 937001370, 'Invicta Women\'s 15150 "Angel" 18k Yellow...)
('B00D3RGO20', 484010722, "Kenneth Cole New York Women's KC4944...)
...
Nice! Now that you know how to get all the important product information you need, let's put that data into a dictionary:
import json
from openpyxl import load_workbook
workbook = load_workbook(filename="sample.xlsx")
sheet = workbook.active
products = {}
# Using the values_only because you want to return the cells' values
for row in sheet.iter_rows(min_row=2,
min_col=4,
max_col=7,
values_only=True):
product_id = row[0]
product = {
"parent": row[1],
"title": row[2],
"category": row[3]
}
products[product_id] = product
# Using json here to be able to format the output for displaying later
print(json.dumps(products))
The code above returns a JSON similar to this:
{
"B00FALQ1ZC": {
"parent": 937001370,
"title": "Invicta Women's 15150 ...",
"category": "Watches"
},
"B00D3RGO20": {
"parent": 484010722,
"title": "Kenneth Cole New York ...",
"category": "Watches"
}
}
Here you can see that the output is trimmed to 2 products only, but if you run the script as it is, then you should get 98 products.
Convert Data Into Python Classes
To finalize the reading section of this tutorial, let's dive into Python classes and see how you could improve on the example above and better structure the data.
For this, you'll be using the new Python Data Classes that are available from Python 3.7.
If you're using an older version of Python, then you can use the default Classes instead.
So, first things first, let's look at the data you have and decide what you want to store and how you want to store it.
As you saw right at the start, this data comes from Amazon, and it's a list of product reviews.
You can check the list of all the columns and their meaning on Amazon.
There are two significant elements you can extract from the data available:
Products
Reviews
A Product has:
ID
Title
Parent
Category
The Review has a few more fields:
ID
Customer ID
Stars
Headline
Body
Date
You can ignore a few of the review fields to make things a bit simpler.
So, a straightforward implementation of these two classes could be written in a separate file classes.py:
import datetime
from dataclasses import dataclass
@dataclass
class Product:
id: str
parent: str
title: str
category: str
@dataclass
class Review:
id: str
customer_id: str
stars: int
headline: str
body: str
date: datetime.datetime
After defining your data classes, you need to convert the data from the spreadsheet into these new structures.
Before doing the conversion, it's worth looking at our header again and creating a mapping between columns and the fields you need:
>>>
>>> for value in sheet.iter_rows(min_row=1,
... max_row=1,
... values_only=True):
... print(value)
('marketplace', 'customer_id', 'review_id', 'product_id', ...)
>>> # Or an alternative
>>> for cell in sheet[1]:
... print(cell.value)
marketplace
customer_id
review_id
product_id
product_parent
...
Let's create a file mapping.py where you have a list of all the field names and their column location (zero-indexed) on the spreadsheet:
# Product fields
PRODUCT_ID = 3
PRODUCT_PARENT = 4
PRODUCT_TITLE = 5
PRODUCT_CATEGORY = 6
# Review fields
REVIEW_ID = 2
REVIEW_CUSTOMER = 1
REVIEW_STARS = 7
REVIEW_HEADLINE = 12
REVIEW_BODY = 13
REVIEW_DATE = 14
You don't necessarily have to do the mapping above.
It's more for readability when parsing the row data, so you don't end up with a lot of magic numbers lying around.
Finally, let's look at the code needed to parse the spreadsheet data into a list of product and review objects:
from datetime import datetime
from openpyxl import load_workbook
from classes import Product, Review
from mapping import PRODUCT_ID, PRODUCT_PARENT, PRODUCT_TITLE, \
PRODUCT_CATEGORY, REVIEW_DATE, REVIEW_ID, REVIEW_CUSTOMER, \
REVIEW_STARS, REVIEW_HEADLINE, REVIEW_BODY
# Using the read_only method since you're not gonna be editing the spreadsheet
workbook = load_workbook(filename="sample.xlsx", read_only=True)
sheet = workbook.active
products = []
reviews = []
# Using the values_only because you just want to return the cell value
for row in sheet.iter_rows(min_row=2, values_only=True):
product = Product(id=row[PRODUCT_ID],
parent=row[PRODUCT_PARENT],
title=row[PRODUCT_TITLE],
category=row[PRODUCT_CATEGORY])
products.append(product)
# You need to parse the date from the spreadsheet into a datetime format
spread_date = row[REVIEW_DATE]
parsed_date = datetime.strptime(spread_date, "%Y-%m-%d")
review = Review(id=row[REVIEW_ID],
customer_id=row[REVIEW_CUSTOMER],
stars=row[REVIEW_STARS],
headline=row[REVIEW_HEADLINE],
body=row[REVIEW_BODY],
date=parsed_date)
reviews.append(review)
print(products[0])
print(reviews[0])
After you run the code above, you should get some output like this:
Product(id='B00FALQ1ZC', parent=937001370, ...)
Review(id='R3O9SGZBVQBV76', customer_id=3653882, ...)
That's it! Now you should have the data in a very simple and digestible class format, and you can start thinking of storing this in a Database or any other type of data storage you like.
Using this kind of OOP strategy to parse spreadsheets makes handling the data much simpler later on.
Appending New Data
Before you start creating very complex spreadsheets, have a quick look at an example of how to append data to an existing spreadsheet.
Go back to the first example spreadsheet you created (
hello_world.xlsx) and try opening it and appending some data to it, like this:
from openpyxl import load_workbook
# Start by opening the spreadsheet and selecting the main sheet
workbook = load_workbook(filename="hello_world.xlsx")
sheet = workbook.active
# Write what you want into a specific cell
sheet["C1"] = "writing ;)"
# Save the spreadsheet
workbook.save(filename="hello_world_append.xlsx")
Et voilà, if you open the new hello_world_append.xlsx spreadsheet, you'll see the following change:
Notice the additional writing ;) on cell C1.
Writing Excel Spreadsheets With openpyxl
There are a lot of different things you can write to a spreadsheet, from simple text or number values to complex formulas, charts, or even images.
Let's start creating some spreadsheets!
Creating a Simple Spreadsheet
Previously, you saw a very quick example of how to write “Hello world!” into a spreadsheet, so you can start with that:
1from openpyxl import Workbook
2
3filename = "hello_world.xlsx"
4
5workbook = Workbook()
6sheet = workbook.active
7
8sheet["A1"] = "hello"
9sheet["B1"] = "world!"
10
11workbook.save(filename=filename)
The highlighted lines in the code above are the most important ones for writing.
In the code, you can see that:
Line 5 shows you how to create a new empty workbook.
Lines 8 and 9 show you how to add data to specific cells.
Line 11 shows you how to save the spreadsheet when you're done.
Even though these lines above can be straightforward, it's still good to know them well for when things get a bit more complicated.
Note: You'll be using the hello_world.xlsx spreadsheet for some of the upcoming examples, so keep it handy.
One thing you can do to help with coming code examples is add the following method to your Python file or console:
>>>
>>> def print_rows():
... for row in sheet.iter_rows(values_only=True):
... print(row)
It makes it easier to print all of your spreadsheet values by just calling print_rows().
Basic Spreadsheet Operations
Before you get into the more advanced topics, it's good for you to know how to manage the most simple elements of a spreadsheet.
Adding and Updating Cell Values
You already learned how to add values to a spreadsheet like this:
>>>
>>> sheet["A1"] = "value"
There's another way you can do this, by first selecting a cell and then changing its value:
>>>
>>> cell = sheet["A1"]
>>> cell
<Cell 'Sheet'.A1>
>>> cell.value
'hello'
>>> cell.value = "hey"
>>> cell.value
'hey'
The new value is only stored into the spreadsheet once you call workbook.save().
The openpyxl creates a cell when adding a value, if that cell didn't exist before:
>>>
>>> # Before, our spreadsheet has only 1 row
>>> print_rows()
('hello', 'world!')
>>> # Try adding a value to row 10
>>> sheet["B10"] = "test"
>>> print_rows()
('hello', 'world!')
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, 'test')
As you can see, when trying to add a value to cell B10, you end up with a tuple with 10 rows, just so you can have that test value.
Managing Rows and Columns
One of the most common things you have to do when manipulating spreadsheets is adding or removing rows and columns.
The openpyxl package allows you to do that in a very straightforward way by using the methods:
.insert_rows()
.delete_rows()
.insert_cols()
.delete_cols()
Every single one of those methods can receive two arguments:
idx
amount
Using our basic hello_world.xlsx example again, let's see how these methods work:
>>>
>>> print_rows()
('hello', 'world!')
>>> # Insert a column before the existing column 1 ("A")
>>> sheet.insert_cols(idx=1)
>>> print_rows()
(None, 'hello', 'world!')
>>> # Insert 5 columns between column 2 ("B") and 3 ("C")
>>> sheet.insert_cols(idx=3, amount=5)
>>> print_rows()
(None, 'hello', None, None, None, None, None, 'world!')
>>> # Delete the created columns
>>> sheet.delete_cols(idx=3, amount=5)
>>> sheet.delete_cols(idx=1)
>>> print_rows()
('hello', 'world!')
>>> # Insert a new row in the beginning
>>> sheet.insert_rows(idx=1)
>>> print_rows()
(None, None)
('hello', 'world!')
>>> # Insert 3 new rows in the beginning
>>> sheet.insert_rows(idx=1, amount=3)
>>> print_rows()
(None, None)
(None, None)
(None, None)
(None, None)
('hello', 'world!')
>>> # Delete the first 4 rows
>>> sheet.delete_rows(idx=1, amount=4)
>>> print_rows()
('hello', 'world!')
The only thing you need to remember is that when inserting new data (rows or columns), the insertion happens before the idx parameter.
So, if you do insert_rows(1), it inserts a new row before the existing first row.
It's the same for columns: when you call insert_cols(2), it inserts a new column right before the already existing second column (
B).
However, when deleting rows or columns, .delete_... deletes data starting from the index passed as an argument.
For example, when doing delete_rows(2) it deletes row 2, and when doing delete_cols(3) it deletes the third column (
C).
Managing Sheets
Sheet management is also one of those things you might need to know, even though it might be something that you don't use that often.
If you look back at the code examples from this tutorial, you'll notice the following recurring piece of code:
sheet = workbook.active
This is the way to select the default sheet from a spreadsheet.
However, if you're opening a spreadsheet with multiple sheets, then you can always select a specific one like this:
>>>
>>> # Let's say you have two sheets: "Products" and "Company Sales"
>>> workbook.sheetnames
['Products', 'Company Sales']
>>> # You can select a sheet using its title
>>> products_sheet = workbook["Products"]
>>> sales_sheet = workbook["Company Sales"]
You can also change a sheet title very easily:
>>>
>>> workbook.sheetnames
['Products', 'Company Sales']
>>> products_sheet = workbook["Products"]
>>> products_sheet.title = "New Products"
>>> workbook.sheetnames
['New Products', 'Company Sales']
If you want to create or delete sheets, then you can also do that with .create_sheet() and .remove():
>>>
>>> workbook.sheetnames
['Products', 'Company Sales']
>>> operations_sheet = workbook.create_sheet("Operations")
>>> workbook.sheetnames
['Products', 'Company Sales', 'Operations']
>>> # You can also define the position to create the sheet at
>>> hr_sheet = workbook.create_sheet("HR", 0)
>>> workbook.sheetnames
['HR', 'Products', 'Company Sales', 'Operations']
>>> # To remove them, just pass the sheet as an argument to the .remove()
>>> workbook.remove(operations_sheet)
>>> workbook.sheetnames
['HR', 'Products', 'Company Sales']
>>> workbook.remove(hr_sheet)
>>> workbook.sheetnames
['Products', 'Company Sales']
One other thing you can do is make duplicates of a sheet using copy_worksheet():
>>>
>>> workbook.sheetnames
['Products', 'Company Sales']
>>> products_sheet = workbook["Products"]
>>> workbook.copy_worksheet(products_sheet)
<Worksheet "Products Copy">
>>> workbook.sheetnames
['Products', 'Company Sales', 'Products Copy']
If you open your spreadsheet after saving the above code, you'll notice that the sheet Products Copy is a duplicate of the sheet Products.
Freezing Rows and Columns
Something that you might want to do when working with big spreadsheets is to freeze a few rows or columns, so they remain visible when you scroll right or down.
Freezing data allows you to keep an eye on important rows or columns, regardless of where you scroll in the spreadsheet.
Again, openpyxl also has a way to accomplish this by using the worksheet freeze_panes attribute.
For this example, go back to our sample.xlsx spreadsheet and try doing the following:
>>>
>>> workbook = load_workbook(filename="sample.xlsx")
>>> sheet = workbook.active
>>> sheet.freeze_panes = "C2"
>>> workbook.save("sample_frozen.xlsx")
If you open the sample_frozen.xlsx spreadsheet in your favorite spreadsheet editor, you'll notice that row 1 and columns A and B are frozen and are always visible no matter where you navigate within the spreadsheet.
This feature is handy, for example, to keep headers within sight, so you always know what each column represents.
Here's how it looks in the editor:
Notice how you're at the end of the spreadsheet, and yet, you can see both row 1 and columns A and B.
Adding Filters
You can use openpyxl to add filters and sorts to your spreadsheet.
However, when you open the spreadsheet, the data won't be rearranged according to these sorts and filters.
At first, this might seem like a pretty useless feature, but when you're programmatically creating a spreadsheet that is going to be sent and used by somebody else, it's still nice to at least create the filters and allow people to use it afterward.
The code below is an example of how you would add some filters to our existing sample.xlsx spreadsheet:
>>>
>>> # Check the used spreadsheet space using the attribute "dimensions"
>>> sheet.dimensions
'A1:O100'
>>> sheet.auto_filter.ref = "A1:O100"
>>> workbook.save(filename="sample_with_filters.xlsx")
You should now see the filters created when opening the spreadsheet in your editor:
You don't have to use sheet.dimensions if you know precisely which part of the spreadsheet you want to apply filters to.
Adding Formulas
Formulas (or formulae) are one of the most powerful features of spreadsheets.
They gives you the power to apply specific mathematical equations to a range of cells.
Using formulas with openpyxl is as simple as editing the value of a cell.
You can see the list of formulas supported by openpyxl:
>>>
>>> from openpyxl.utils import FORMULAE
>>> FORMULAE
frozenset({'ABS',
'ACCRINT',
'ACCRINTM',
'ACOS',
'ACOSH',
'AMORDEGRC',
'AMORLINC',
'AND',
...
'YEARFRAC',
'YIELD',
'YIELDDISC',
'YIELDMAT',
'ZTEST'})
Let's add some formulas to our sample.xlsx spreadsheet.
Starting with something easy, let's check the average star rating for the 99 reviews within the spreadsheet:
>>>
>>> # Star rating is column "H"
>>> sheet["P2"] = "=AVERAGE(H2:H100)"
>>> workbook.save(filename="sample_formulas.xlsx")
If you open the spreadsheet now and go to cell P2, you should see that its value is: 4.18181818181818.
Have a look in the editor:
You can use the same methodology to add any formulas to your spreadsheet.
For example, let's count the number of reviews that had helpful votes:
>>>
>>> # The helpful votes are counted on column "I"
>>> sheet["P3"] = '=COUNTIF(I2:I100, ">0")'
>>> workbook.save(filename="sample_formulas.xlsx")
You should get the number 21 on your P3 spreadsheet cell like so:
You'll have to make sure that the strings within a formula are always in double quotes, so you either have to use single quotes around the formula like in the example above or you'll have to escape the double quotes inside the formula: "=COUNTIF(I2:I100, \">0\")".
There are a ton of other formulas you can add to your spreadsheet using the same procedure you tried above.
Give it a go yourself!
Adding Styles
Even though styling a spreadsheet might not be something you would do every day, it's still good to know how to do it.
Using openpyxl, you can apply multiple styling options to your spreadsheet, including fonts, borders, colors, and so on.
Have a look at the openpyxl documentation to learn more.
You can also choose to either apply a style directly to a cell or create a template and reuse it to apply styles to multiple cells.
Let's start by having a look at simple cell styling, using our sample.xlsx again as the base spreadsheet:
>>>
>>> # Import necessary style classes
>>> from openpyxl.styles import Font, Color, Alignment, Border, Side
>>> # Create a few styles
>>> bold_font = Font(bold=True)
>>> big_red_text = Font(color="00FF0000", size=20)
>>> center_aligned_text = Alignment(horizontal="center")
>>> double_border_side = Side(border_style="double")
>>> square_border = Border(top=double_border_side,
... right=double_border_side,
... bottom=double_border_side,
... left=double_border_side)
>>> # Style some cells!
>>> sheet["A2"].font = bold_font
>>> sheet["A3"].font = big_red_text
>>> sheet["A4"].alignment = center_aligned_text
>>> sheet["A5"].border = square_border
>>> workbook.save(filename="sample_styles.xlsx")
If you open your spreadsheet now, you should see quite a few different styles on the first 5 cells of column A:
There you go.
You got:
A2 with the text in bold
A3 with the text in red and bigger font size
A4 with the text centered
A5 with a square border around the text
Note: For the colors, you can also use HEX codes instead by doing Font(color="C70E0F").
You can also combine styles by simply adding them to the cell at the same time:
>>>
>>> # Reusing the same styles from the example above
>>> sheet["A6"].alignment = center_aligned_text
>>> sheet["A6"].font = big_red_text
>>> sheet["A6"].border = square_border
>>> workbook.save(filename="sample_styles.xlsx")
Have a look at cell A6 here:
When you want to apply multiple styles to one or several cells, you can use a NamedStyle class instead, which is like a style template that you can use over and over again.
Have a look at the example below:
>>>
>>> from openpyxl.styles import NamedStyle
>>> # Let's create a style template for the header row
>>> header = NamedStyle(name="header")
>>> header.font = Font(bold=True)
>>> header.border = Border(bottom=Side(border_style="thin"))
>>> header.alignment = Alignment(horizontal="center", vertical="center")
>>> # Now let's apply this to all first row (header) cells
>>> header_row = sheet[1]
>>> for cell in header_row:
... cell.style = header
>>> workbook.save(filename="sample_styles.xlsx")
If you open the spreadsheet now, you should see that its first row is bold, the text is aligned to the center, and there's a small bottom border! Have a look below:
As you saw above, there are many options when it comes to styling, and it depends on the use case, so feel free to check openpyxl documentation and see what other things you can do.
Conditional Formatting
This feature is one of my personal favorites when it comes to adding styles to a spreadsheet.
It's a much more powerful approach to styling because it dynamically applies styles according to how the data in the spreadsheet changes.
In a nutshell, conditional formatting allows you to specify a list of styles to apply to a cell (or cell range) according to specific conditions.
For example, a widespread use case is to have a balance sheet where all the negative totals are in red, and the positive ones are in green.
This formatting makes it much more efficient to spot good vs bad periods.
Without further ado, let's pick our favorite spreadsheet—
sample.xlsx—and add some conditional formatting.
You can start by adding a simple one that adds a red background to all reviews with less than 3 stars:
>>>
>>> from openpyxl.styles import PatternFill
>>> from openpyxl.styles.differential import DifferentialStyle
>>> from openpyxl.formatting.rule import Rule
>>> red_background = PatternFill(fgColor="00FF0000")
>>> diff_style = DifferentialStyle(fill=red_background)
>>> rule = Rule(type="expression", dxf=diff_style)
>>> rule.formula = ["$H1<3"]
>>> sheet.conditional_formatting.add("A1:O100", rule)
>>> workbook.save("sample_conditional_formatting.xlsx")
Now you'll see all the reviews with a star rating below 3 marked with a red background:
Code-wise, the only things that are new here are the objects DifferentialStyle and Rule:
DifferentialStyle is quite similar to NamedStyle, which you already saw above, and it's used to aggregate multiple styles such as fonts, borders, alignment, and so forth.
Rule is responsible for selecting the cells and applying the styles if the cells match the rule's logic.
Using a Rule object, you can create numerous conditional formatting scenarios.
However, for simplicity sake, the openpyxl package offers 3 built-in formats that make it easier to create a few common conditional formatting patterns.
These built-ins are:
ColorScale
IconSet
DataBar
The ColorScale gives you the ability to create color gradients:
>>>
>>> from openpyxl.formatting.rule import ColorScaleRule
>>> color_scale_rule = ColorScaleRule(start_type="min",
... start_color="00FF0000", # Red
... end_type="max",
... end_color="0000FF00") # Green
>>> # Again, let's add this gradient to the star ratings, column "H"
>>> sheet.conditional_formatting.add("H2:H100", color_scale_rule)
>>> workbook.save(filename="sample_conditional_formatting_color_scale.xlsx")
Now you should see a color gradient on column H, from red to green, according to the star rating:
You can also add a third color and make two gradients instead:
>>>
>>> from openpyxl.formatting.rule import ColorScaleRule
>>> color_scale_rule = ColorScaleRule(start_type="num",
... start_value=1,
... start_color="00FF0000", # Red
... mid_type="num",
... mid_value=3,
... mid_color="00FFFF00", # Yellow
... end_type="num",
... end_value=5,
... end_color="0000FF00") # Green
>>> # Again, let's add this gradient to the star ratings, column "H"
>>> sheet.conditional_formatting.add("H2:H100", color_scale_rule)
>>> workbook.save(filename="sample_conditional_formatting_color_scale_3.xlsx")
This time, you'll notice that star ratings between 1 and 3 have a gradient from red to yellow, and star ratings between 3 and 5 have a gradient from yellow to green:
The IconSet allows you to add an icon to the cell according to its value:
>>>
>>> from openpyxl.formatting.rule import IconSetRule
>>> icon_set_rule = IconSetRule("5Arrows", "num", [1, 2, 3, 4, 5])
>>> sheet.conditional_formatting.add("H2:H100", icon_set_rule)
>>> workbook.save("sample_conditional_formatting_icon_set.xlsx")
You'll see a colored arrow next to the star rating.
This arrow is red and points down when the value of the cell is 1 and, as the rating gets better, the arrow starts pointing up and becomes green:
The openpyxl package has a full list of other icons you can use, besides the arrow.
Finally, the DataBar allows you to create progress bars:
>>>
>>> from openpyxl.formatting.rule import DataBarRule
>>> data_bar_rule = DataBarRule(start_type="num",
... start_value=1,
... end_type="num",
... end_value="5",
... color="0000FF00") # Green
>>> sheet.conditional_formatting.add("H2:H100", data_bar_rule)
>>> workbook.save("sample_conditional_formatting_data_bar.xlsx")
You'll now see a green progress bar that gets fuller the closer the star rating is to the number 5:
As you can see, there are a lot of cool things you can do with conditional formatting.
Here, you saw only a few examples of what you can achieve with it, but check the openpyxl documentation to see a bunch of other options.
Adding Images
Even though images are not something that you'll often see in a spreadsheet, it's quite cool to be able to add them.
Maybe you can use it for branding purposes or to make spreadsheets more personal.
To be able to load images to a spreadsheet using openpyxl, you'll have to install Pillow:
$ pip install Pillow
Apart from that, you'll also need an image.
For this example, you can grab the Real Python logo below and convert it from .webp to .png using an online converter such as cloudconvert.com, save the final file as logo.png, and copy it to the root folder where you're running your examples:
Afterward, this is the code you need to import that image into the hello_word.xlsx spreadsheet:
from openpyxl import load_workbook
from openpyxl.drawing.image import Image
# Let's use the hello_world spreadsheet since it has less data
workbook = load_workbook(filename="hello_world.xlsx")
sheet = workbook.active
logo = Image("logo.png")
# A bit of resizing to not fill the whole spreadsheet with the logo
logo.height = 150
logo.width = 150
sheet.add_image(logo, "A3")
workbook.save(filename="hello_world_logo.xlsx")
You have an image on your spreadsheet! Here it is:
The image's left top corner is on the cell you chose, in this case, A3.
Adding Pretty Charts
Another powerful thing you can do with spreadsheets is create an incredible variety of charts.
Charts are a great way to visualize and understand loads of data quickly.
There are a lot of different chart types: bar chart, pie chart, line chart, and so on.
openpyxl has support for a lot of them.
Here, you'll see only a couple of examples of charts because the theory behind it is the same for every single chart type:
Note: A few of the chart types that openpyxl currently doesn't have support for are Funnel, Gantt, Pareto, Treemap, Waterfall, Map, and Sunburst.
For any chart you want to build, you'll need to define the chart type: BarChart, LineChart, and so forth, plus the data to be used for the chart, which is called Reference.
Before you can build your chart, you need to define what data you want to see represented in it.
Sometimes, you can use the dataset as is, but other times you need to massage the data a bit to get additional information.
Let's start by building a new workbook with some sample data:
1from openpyxl import Workbook
2from openpyxl.chart import BarChart, Reference
3
4workbook = Workbook()
5sheet = workbook.active
6
7# Let's create some sample sales data
8rows = [
9 ["Product", "Online", "Store"],
10 [1, 30, 45],
11 [2, 40, 30],
12 [3, 40, 25],
13 [4, 50, 30],
14 [5, 30, 25],
15 [6, 25, 35],
16 [7, 20, 40],
17]
18
19for row in rows:
20 sheet.append(row)
Now you're going to start by creating a bar chart that displays the total number of sales per product:
22chart = BarChart()
23data = Reference(worksheet=sheet,
24 min_row=1,
25 max_row=8,
26 min_col=2,
27 max_col=3)
28
29chart.add_data(data, titles_from_data=True)
30sheet.add_chart(chart, "E2")
31
32workbook.save("chart.xlsx")
There you have it.
Below, you can see a very straightforward bar chart showing the difference between online product sales online and in-store product sales:
Like with images, the top left corner of the chart is on the cell you added the chart to.
In your case, it was on cell E2.
Note: Depending on whether you're using Microsoft Excel or an open-source alternative (LibreOffice or OpenOffice), the chart might look slightly different.
Try creating a line chart instead, changing the data a bit:
1import random
2from openpyxl import Workbook
3from openpyxl.chart import LineChart, Reference
4
5workbook = Workbook()
6sheet = workbook.active
7
8# Let's create some sample sales data
9rows = [
10 ["", "January", "February", "March", "April",
11 "May", "June", "July", "August", "September",
12 "October", "November", "December"],
13 [1, ],
14 [2, ],
15 [3, ],
16]
17
18for row in rows:
19 sheet.append(row)
20
21for row in sheet.iter_rows(min_row=2,
22 max_row=4,
23 min_col=2,
24 max_col=13):
25 for cell in row:
26 cell.value = random.randrange(5, 100)
With the above code, you'll be able to generate some random data regarding the sales of 3 different products across a whole year.
Once that's done, you can very easily create a line chart with the following code:
28chart = LineChart()
29data = Reference(worksheet=sheet,
30 min_row=2,
31 max_row=4,
32 min_col=1,
33 max_col=13)
34
35chart.add_data(data, from_rows=True, titles_from_data=True)
36sheet.add_chart(chart, "C6")
37
38workbook.save("line_chart.xlsx")
Here's the outcome of the above piece of code:
One thing to keep in mind here is the fact that you're using from_rows=True when adding the data.
This argument makes the chart plot row by row instead of column by column.
In your sample data, you see that each product has a row with 12 values (1 column per month).
That's why you use from_rows.
If you don't pass that argument, by default, the chart tries to plot by column, and you'll get a month-by-month comparison of sales.
Another difference that has to do with the above argument change is the fact that our Reference now starts from the first column, min_col=1, instead of the second one.
This change is needed because the chart now expects the first column to have the titles.
There are a couple of other things you can also change regarding the style of the chart.
For example, you can add specific categories to the chart:
cats = Reference(worksheet=sheet,
min_row=1,
max_row=1,
min_col=2,
max_col=13)
chart.set_categories(cats)
Add this piece of code before saving the workbook, and you should see the month names appearing instead of numbers:
Code-wise, this is a minimal change.
But in terms of the readability of the spreadsheet, this makes it much easier for someone to open the spreadsheet and understand the chart straight away.
Another thing you can do to improve the chart readability is to add an axis.
You can do it using the attributes x_axis and y_axis:
chart.x_axis.title = "Months"
chart.y_axis.title = "Sales (per unit)"
This will generate a spreadsheet like the below one:
As you can see, small changes like the above make reading your chart a much easier and quicker task.
There is also a way to style your chart by using Excel's default ChartStyle property.
In this case, you have to choose a number between 1 and 48.
Depending on your choice, the colors of your chart change as well:
# You can play with this by choosing any number between 1 and 48
chart.style = 24
With the style selected above, all lines have some shade of orange:
There is no clear documentation on what each style number looks like, but this spreadsheet has a few examples of the styles available.
Here's the full code used to generate the line chart with categories, axis titles, and style:
import random
from openpyxl import Workbook
from openpyxl.chart import LineChart, Reference
workbook = Workbook()
sheet = workbook.active
# Let's create some sample sales data
rows = [
["", "January", "February", "March", "April",
"May", "June", "July", "August", "September",
"October", "November", "December"],
[1, ],
[2, ],
[3, ],
]
for row in rows:
sheet.append(row)
for row in sheet.iter_rows(min_row=2,
max_row=4,
min_col=2,
max_col=13):
for cell in row:
cell.value = random.randrange(5, 100)
# Create a LineChart and add the main data
chart = LineChart()
data = Reference(worksheet=sheet,
min_row=2,
max_row=4,
min_col=1,
max_col=13)
chart.add_data(data, titles_from_data=True, from_rows=True)
# Add categories to the chart
cats = Reference(worksheet=sheet,
min_row=1,
max_row=1,
min_col=2,
max_col=13)
chart.set_categories(cats)
# Rename the X and Y Axis
chart.x_axis.title = "Months"
chart.y_axis.title = "Sales (per unit)"
# Apply a specific Style
chart.style = 24
# Save!
sheet.add_chart(chart, "C6")
workbook.save("line_chart.xlsx")
There are a lot more chart types and customization you can apply, so be sure to check out the package documentation on this if you need some specific formatting.
Convert Python Classes to Excel Spreadsheet
You already saw how to convert an Excel spreadsheet's data into Python classes, but now let's do the opposite.
Let's imagine you have a database and are using some Object-Relational Mapping (ORM) to map DB objects into Python classes.
Now, you want to export those same objects into a spreadsheet.
Let's assume the following data classes to represent the data coming from your database regarding product sales:
from dataclasses import dataclass
from typing import List
@dataclass
class Sale:
quantity: int
@dataclass
class Product:
id: str
name: str
sales: List[Sale]
Now, let's generate some random data, assuming the above classes are stored in a db_classes.py file:
1import random
2
3# Ignore these for now.
You'll use them in a sec ;)
4from openpyxl import Workbook
5from openpyxl.chart import LineChart, Reference
6
7from db_classes import Product, Sale
8
9products = []
10
11# Let's create 5 products
12for idx in range(1, 6):
13 sales = []
14
15 # Create 5 months of sales
16 for _ in range(5):
17 sale = Sale(quantity=random.randrange(5, 100))
18 sales.append(sale)
19
20 product = Product(id=str(idx),
21 name="Product %s" % idx,
22 sales=sales)
23 products.append(product)
By running this piece of code, you should get 5 products with 5 months of sales with a random quantity of sales for each month.
Now, to convert this into a spreadsheet, you need to iterate over the data and append it to the spreadsheet:
25workbook = Workbook()
26sheet = workbook.active
27
28# Append column names first
29sheet.append(["Product ID", "Product Name", "Month 1",
30 "Month 2", "Month 3", "Month 4", "Month 5"])
31
32# Append the data
33for product in products:
34 data = [product.id, product.name]
35 for sale in product.sales:
36 data.append(sale.quantity)
37 sheet.append(data)
That's it.
That should allow you to create a spreadsheet with some data coming from your database.
However, why not use some of that cool knowledge you gained recently to add a chart as well to display that data more visually?
All right, then you could probably do something like this:
38chart = LineChart()
39data = Reference(worksheet=sheet,
40 min_row=2,
41 max_row=6,
42 min_col=2,
43 max_col=7)
44
45chart.add_data(data, titles_from_data=True, from_rows=True)
46sheet.add_chart(chart, "B8")
47
48cats = Reference(worksheet=sheet,
49 min_row=1,
50 max_row=1,
51 min_col=3,
52 max_col=7)
53chart.set_categories(cats)
54
55chart.x_axis.title = "Months"
56chart.y_axis.title = "Sales (per unit)"
57
58workbook.save(filename="oop_sample.xlsx")
Now we're talking! Here's a spreadsheet generated from database objects and with a chart and everything:
That's a great way for you to wrap up your new knowledge of charts!
Bonus: Working With Pandas
Even though you can use Pandas to handle Excel files, there are few things that you either can't accomplish with Pandas or that you'd be better off just using openpyxl directly.
For example, some of the advantages of using openpyxl are the ability to easily customize your spreadsheet with styles, conditional formatting, and such.
But guess what, you don't have to worry about picking.
In fact, openpyxl has support for both converting data from a Pandas DataFrame into a workbook or the opposite, converting an openpyxl workbook into a Pandas DataFrame.
Note: If you're new to Pandas, check our course on Pandas DataFrames beforehand.
First things first, remember to install the pandas package:
$ pip install pandas
Then, let's create a sample DataFrame:
1import pandas as pd
2
3data = {
4 "Product Name": ["Product 1", "Product 2"],
5 "Sales Month 1": [10, 20],
6 "Sales Month 2": [5, 35],
7}
8df = pd.DataFrame(data)
Now that you have some data, you can use .dataframe_to_rows() to convert it from a DataFrame into a worksheet:
10from openpyxl import Workbook
11from openpyxl.utils.dataframe import dataframe_to_rows
12
13workbook = Workbook()
14sheet = workbook.active
15
16for row in dataframe_to_rows(df, index=False, header=True):
17 sheet.append(row)
18
19workbook.save("pandas.xlsx")
You should see a spreadsheet that looks like this:
If you want to add the DataFrame's index, you can change index=True, and it adds each row's index into your spreadsheet.
On the other hand, if you want to convert a spreadsheet into a DataFrame, you can also do it in a very straightforward way like so:
import pandas as pd
from openpyxl import load_workbook
workbook = load_workbook(filename="sample.xlsx")
sheet = workbook.active
values = sheet.values
df = pd.DataFrame(values)
Alternatively, if you want to add the correct headers and use the review ID as the index, for example, then you can also do it like this instead:
import pandas as pd
from openpyxl import load_workbook
from mapping import REVIEW_ID
workbook = load_workbook(filename="sample.xlsx")
sheet = workbook.active
data = sheet.values
# Set the first row as the columns for the DataFrame
cols = next(data)
data = list(data)
# Set the field "review_id" as the indexes for each row
idx = [row[REVIEW_ID] for row in data]
df = pd.DataFrame(data, index=idx, columns=cols)
Using indexes and columns allows you to access data from your DataFrame easily:
>>>
>>> df.columns
Index(['marketplace', 'customer_id', 'review_id', 'product_id',
'product_parent', 'product_title', 'product_category', 'star_rating',
'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
'review_headline', 'review_body', 'review_date'],
dtype='object')
>>> # Get first 10 reviews' star rating
>>> df["star_rating"][:10]
R3O9SGZBVQBV76 5
RKH8BNC3L5DLF 5
R2HLE8WKZSU3NL 2
R31U3UH5AZ42LL 5
R2SV659OUJ945Y 4
RA51CP8TR5A2L 5
RB2Q7DLDN6TH6 5
R2RHFJV0UYBK3Y 1
R2Z6JOQ94LFHEP 5
RX27XIIWY5JPB 4
Name: star_rating, dtype: int64
>>> # Grab review with id "R2EQL1V1L6E0C9", using the index
>>> df.loc["R2EQL1V1L6E0C9"]
marketplace US
customer_id 15305006
review_id R2EQL1V1L6E0C9
product_id B004LURNO6
product_parent 892860326
review_headline Five Stars
review_body Love it
review_date 2015-08-31
Name: R2EQL1V1L6E0C9, dtype: object
There you go, whether you want to use openpyxl to prettify your Pandas dataset or use Pandas to do some hardcore algebra, you now know how to switch between both packages.
run Python script in HTML
run a Python file using HTML using PHP.
Add a PHP file as index.php:
<html>
<head>
<title>Run my Python files</title>
<?PHP
echo shell_exec("python test.py 'parameter1'");
?>
</head>
Passing the parameter to Python
Create a Python file as test.py:
import sys
input=sys.argv[1]
print(input)
Print the parameter passed by PHP.
OHLC Charts in Python
import plotly.graph_objects as go
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
fig = go.Figure(data=go.Ohlc(x=df['Date'],
open=df['AAPL.Open'],
high=df['AAPL.High'],
low=df['AAPL.Low'],
close=df['AAPL.Close']))
fig.show()
two independent programs to communicate with each other
The best way for two independent programs to communicate with each other depends on the specific use case and requirements of the programs.
Both reading and writing to a file and using a local TCP connection are common methods for inter-process communication.
Reading and writing to a file can be a simple and effective way to share data between programs.
However, it may not be the best option for real-time communication or when large amounts of data need to be exchanged frequently.
Using a local TCP connection can provide more real-time communication and can handle larger amounts of data.
However, it requires more setup and configuration, and may not be necessary for simpler communication needs.
Both methods are commonly used in inter-process communication.
To use a local TCP connection for communication between two independent programs, you need to follow these general steps:
Establish a TCP server in one program:
Choose one of the programs to act as the server that will listen for incoming connections.
Create a TCP socket in the server program and bind it to a specific port.
The port number can be any available port that is not already in use.
Here's an example of how to set up a TCP server in Python:
python
import socket
# Create a TCP socket
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to a specific address and port
server_address = ('localhost', 5000) # Replace 'localhost' with the server's IP address if needed
server_socket.bind(server_address)
# Listen for incoming connections
server_socket.listen(1)
# Accept a client connection
client_socket, client_address = server_socket.accept()
# Now the server is ready to communicate with the client
Connect the TCP client to the server:
In the other program, create a TCP socket and connect it to the server's IP address and port.
Once the connection is established, the client program can send and receive data to/from the server.
Here's an example of how the client program can send and receive data to/from the server using the local TCP connection:
python
import socket
# Create a TCP socket
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Connect to the server
server_address = ('localhost', 5000) # Replace 'localhost' with the server's IP address if needed
client_socket.connect(server_address)
# Send data to the server
data_to_send = "Hello, server!"
client_socket.sendall(data_to_send.encode())
# Receive data from the server
received_data = client_socket.recv(1024).decode()
print("Received data from server:", received_data)
# Close the connection
client_socket.close()
In this example, the client program creates a TCP socket, connects to the server's IP address and port, and sends data to the server using the sendall() method after encoding the data as bytes.
It then waits to receive a response from the server using the recv() method, specifying the maximum number of bytes to receive (1024 in this case).
The received data is decoded from bytes to a string and printed.
On the server side, you can use a similar approach to receive data from the client and send a response back.
Remember to replace 'localhost' with the appropriate IP address if the server is running on a different machine.
Additionally, you can add exception handling to gracefully handle errors during the connection and communication process.
Here's an example of how the server can receive data from the client and send a response back using the local TCP connection:
python
import socket
# Create a TCP socket
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to a specific address and port
server_address = ('localhost', 5000) # Replace 'localhost' with the server's IP address if needed
server_socket.bind(server_address)
# Listen for incoming connections
server_socket.listen(1)
# Accept a client connection
client_socket, client_address = server_socket.accept()
# Receive data from the client
received_data = client_socket.recv(1024).decode()
print("Received data from client:", received_data)
# Process the received data (e.g., perform calculations, generate a response)
# Send a response back to the client
response_data = "Hello, client!"
client_socket.sendall(response_data.encode())
# Close the connection
client_socket.close()
server_socket.close()
In this example, after accepting the client connection, the server program waits to receive data from the client using the recv() method, specifying the maximum number of bytes to receive (1024 in this case).
The received data is then decoded from bytes to a string and processed as needed. In this case, we simply generate a response message.
After processing the data and generating a response, the server uses the sendall() method to send the response back to the client.
The response data is encoded as bytes before sending.
Finally, the server and client sockets are closed to release the resources and terminate the connection.
Remember to replace 'localhost' with the appropriate IP address if the server is running on a different machine. Similarly, you can add exception handling to handle errors gracefully during the connection and communication process.
使用Python读取日期和时间
可以使用内置的文件操作功能和日期时间处理模块。
import datetime
# 读取闹钟日期和时间的文本文件路径
file_path = "path/to/alarms.txt"
# 存储闹钟日期和时间的列表
alarms = []
# 打开文本文件并读取闹钟日期和时间
with open(file_path, "r") as file:
for line in file:
alarm = line.strip()
alarms.append(alarm)
# 处理每个闹钟
for alarm in alarms:
# 获取当前日期和时间
current_datetime = datetime.datetime.now()
alarm_datetime = datetime.datetime.strptime(alarm, "%Y-%m-%d %H:%M:%S")
# 计算下一个闹钟日期和时间
if alarm_datetime < current_datetime:
next_alarm = alarm_datetime + datetime.timedelta(days=1)
else:
next_alarm = alarm_datetime
# 计算闹钟触发时间间隔(秒)
interval = (next_alarm - current_datetime).total_seconds()
# 等待时间间隔并触发闹钟
import time
time.sleep(interval)
print("闹钟日期和时间:", alarm)
print("闹钟响铃!")
在这个示例代码中,您需要将 file_path 变量设置为包含闹钟日期和时间的文本文件的路径。
代码将打开文件并逐行读取闹钟的日期和时间,然后将其存储在 alarms 列表中。
每个闹钟的日期和时间应以 "YYYY-MM-DD HH:MM:SS" 的格式存储在文本文件中,每行一个闹钟。
代码处理每个闹钟,计算下一个闹钟日期和时间,并使用 time.sleep 函数等待时间间隔,然后触发闹钟。
在示例代码中,我使用 print 语句来显示闹钟日期和时间以及响铃提醒,您可以根据需要进行调整。
Python内置数据库:SQLite
import sqlite3
# 连接到数据库
conn = sqlite3.connect('example.db')
# 创建一个游标对象
cursor = conn.cursor()
# 执行一个查询
cursor.execute('SELECT SQLITE_VERSION()')
# 打印查询结果
data = cursor.fetchone()
print("SQLite version:", data)
# SQLite version: ('3.40.1',)
# 创建表格
# 创建一个名为students的表,包含id、name和age三个字段
cursor.execute('''CREATE TABLE students (id INTEGER PRIMARY KEY, name TEXT, age INTEGER)''')
# cursor.execute('''CREATE TABLE stocks
# (date text, trans text, symbol text, qty real, price real)''')
# 插入数据
# 向students表中插入一条数据
cursor.execute("INSERT INTO students (name, age) VALUES ('张三', 20)")
# cursor.execute("INSERT INTO stocks VALUES ('2022-10-28', 'BUY', 'GOOG', 100, 490.1)")
# 保存更改
conn.commit()
# 查询users表中的所有数据
cursor.execute("SELECT * FROM students")
rows = cursor.fetchall()
# 打印查询结果
for row in rows:
print(row)
# 更新users表中id为1的数据的name字段为'李四'
cursor.execute("UPDATE students SET name=? WHERE id=?", ('李四', 1))
# 查询users表中的所有数据
cursor.execute("SELECT * FROM students")
rows = cursor.fetchall()
# 打印查询结果
for row in rows:
print(row)
# 删除users表中id为1的数据
cursor.execute("DELETE FROM students WHERE id=?", (1,))
# 提交更改并关闭连接
conn.commit()
# 关闭连接
conn.close()
To use sqlite3 module, you must first create a connection object that represents the database and then optionally you can create a cursor object, which will help you in executing all the SQL statements.
Python sqlite3 module APIs
⇧
Following are important sqlite3 module routines, which can suffice your requirement to work with SQLite database from your Python program.
If you are looking for a more sophisticated application, then you can look into Python sqlite3 module's official documentation.
This API opens a connection to the SQLite database file.
You can use ":memory:" to open a database connection to a database that resides in RAM instead of on disk.
If database is opened successfully, it returns a connection object.
When a database is accessed by multiple connections, and one of the processes modifies the database, the SQLite database is locked until that transaction is committed.
The timeout parameter specifies how long the connection should wait for the lock to go away until raising an exception.
The default for the timeout parameter is 5.0 (five seconds).
If the given database name does not exist then this call will create the database.
You can specify filename with the required path as well if you want to create a database anywhere else except in the current directory.
2
connection.cursor([cursorClass])
This routine creates a cursor which will be used throughout of your database programming with Python.
This method accepts a single optional parameter cursorClass.
If supplied, this must be a custom cursor class that extends sqlite3.Cursor.
3
cursor.execute(sql [, optional parameters])
This routine executes an SQL statement.
The SQL statement may be parameterized (i.e. placeholders instead of SQL literals).
The sqlite3 module supports two kinds of placeholders: question marks and named placeholders (named style).
For example − cursor.execute("insert into people values (?, ?)", (who, age))
4
connection.execute(sql [, optional parameters])
This routine is a shortcut of the above execute method provided by the cursor object and it creates an intermediate cursor object by calling the cursor method, then calls the cursor's execute method with the parameters given.
5
cursor.executemany(sql, seq_of_parameters)
This routine executes an SQL command against all parameter sequences or mappings found in the sequence sql.
6
connection.executemany(sql[, parameters])
This routine is a shortcut that creates an intermediate cursor object by calling the cursor method, then calls the cursor.s executemany method with the parameters given.
7
cursor.executescript(sql_script)
This routine executes multiple SQL statements at once provided in the form of script.
It issues a COMMIT statement first, then executes the SQL script it gets as a parameter.
All the SQL statements should be separated by a semi colon (;).
8
connection.executescript(sql_script)
This routine is a shortcut that creates an intermediate cursor object by calling the cursor method, then calls the cursor's executescript method with the parameters given.
9
connection.total_changes()
This routine returns the total number of database rows that have been modified, inserted, or deleted since the database connection was opened.
10
connection.commit()
This method commits the current transaction.
If you don't call this method, anything you did since the last call to commit() is not visible from other database connections.
11
connection.rollback()
This method rolls back any changes to the database since the last call to commit().
12
connection.close()
This method closes the database connection.
Note that this does not automatically call commit().
If you just close your database connection without calling commit() first, your changes will be lost!
13
cursor.fetchone()
This method fetches the next row of a query result set, returning a single sequence, or None when no more data is available.
14
cursor.fetchmany([size = cursor.arraysize])This routine fetches the next set of rows of a query result, returning a list.
An empty list is returned when no more rows are available.
The method tries to fetch as many rows as indicated by the size parameter.
15
cursor.fetchall()
This routine fetches all (remaining) rows of a query result, returning a list.
An empty list is returned when no rows are available.
Connect To Database
⇧
Following Python code shows how to connect to an existing database.
If the database does not exist, then it will be created and finally a database object will be returned.
#!/usr/bin/python
import sqlite3
conn = sqlite3.connect('test.db')
print "Opened database successfully";
Here, you can also supply database name as the special name :memory: to create a database in RAM.
Now, let's run the above program to create our database test.db in the current directory.
You can change your path as per your requirement.
Keep the above code in sqlite.py file and execute it as shown below.
If the database is successfully created, then it will display the following message.
$chmod +x sqlite.py
$./sqlite.py
Open database successfully
Create a Table
⇧
Following Python program will be used to create a table in the previously created database.
#!/usr/bin/python
import sqlite3
conn = sqlite3.connect('test.db')
print "Opened database successfully";
conn.execute('''CREATE TABLE COMPANY
(ID INT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL);''')
print "Table created successfully";
conn.close()
When the above program is executed, it will create the COMPANY table in your test.db and it will display the following messages −
Opened database successfully
Table created successfully
INSERT Operation
⇧
Following Python program shows how to create records in the COMPANY table created in the above example.
#!/usr/bin/python
import sqlite3
conn = sqlite3.connect('test.db')
print "Opened database successfully";
conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (1, 'Paul', 32, 'California', 20000.00 )");
conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (2, 'Allen', 25, 'Texas', 15000.00 )");
conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (3, 'Teddy', 23, 'Norway', 20000.00 )");
conn.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 )");
conn.commit()
print "Records created successfully";
conn.close()
When the above program is executed, it will create the given records in the COMPANY table and it will display the following two lines −
Opened database successfully
Records created successfully
SELECT Operation
⇧
Following Python program shows how to fetch and display records from the COMPANY table created in the above example.
#!/usr/bin/python
import sqlite3
conn = sqlite3.connect('test.db')
print "Opened database successfully";
cursor = conn.execute("SELECT id, name, address, salary from COMPANY")
for row in cursor:
print "ID = ", row[0]
print "NAME = ", row[1]
print "ADDRESS = ", row[2]
print "SALARY = ", row[3], "\n"
print "Operation done successfully";
conn.close()
When the above program is executed, it will produce the following result.
Opened database successfully
ID = 1
NAME = Paul
ADDRESS = California
SALARY = 20000.0
ID = 2
NAME = Allen
ADDRESS = Texas
SALARY = 15000.0
ID = 3
NAME = Teddy
ADDRESS = Norway
SALARY = 20000.0
ID = 4
NAME = Mark
ADDRESS = Rich-Mond
SALARY = 65000.0
Operation done successfully
UPDATE Operation
⇧
Following Python code shows how to use UPDATE statement to update any record and then fetch and display the updated records from the COMPANY table.
#!/usr/bin/python
import sqlite3
conn = sqlite3.connect('test.db')
print "Opened database successfully";
conn.execute("UPDATE COMPANY set SALARY = 25000.00 where ID = 1")
conn.commit()
print "Total number of rows updated :", conn.total_changes
cursor = conn.execute("SELECT id, name, address, salary from COMPANY")
for row in cursor:
print "ID = ", row[0]
print "NAME = ", row[1]
print "ADDRESS = ", row[2]
print "SALARY = ", row[3], "\n"
print "Operation done successfully";
conn.close()
When the above program is executed, it will produce the following result.
Opened database successfully
Total number of rows updated : 1
ID = 1
NAME = Paul
ADDRESS = California
SALARY = 25000.0
ID = 2
NAME = Allen
ADDRESS = Texas
SALARY = 15000.0
ID = 3
NAME = Teddy
ADDRESS = Norway
SALARY = 20000.0
ID = 4
NAME = Mark
ADDRESS = Rich-Mond
SALARY = 65000.0
Operation done successfully
DELETE Operation
⇧
Following Python code shows how to use DELETE statement to delete any record and then fetch and display the remaining records from the COMPANY table.
#!/usr/bin/python
import sqlite3
conn = sqlite3.connect('test.db')
print "Opened database successfully";
conn.execute("DELETE from COMPANY where ID = 2;")
conn.commit()
print "Total number of rows deleted :", conn.total_changes
cursor = conn.execute("SELECT id, name, address, salary from COMPANY")
for row in cursor:
print "ID = ", row[0]
print "NAME = ", row[1]
print "ADDRESS = ", row[2]
print "SALARY = ", row[3], "\n"
print "Operation done successfully";
conn.close()
When the above program is executed, it will produce the following result.
Opened database successfully
Total number of rows deleted : 1
ID = 1
NAME = Paul
ADDRESS = California
SALARY = 20000.0
ID = 3
NAME = Teddy
ADDRESS = Norway
SALARY = 20000.0
ID = 4
NAME = Mark
ADDRESS = Rich-Mond
SALARY = 65000.0
Operation done successfully
AI can provide valuable assistance in learning programming
specifically Python, in the following ways:
Interactive Learning Platforms: AI-powered platforms can offer interactive lessons and tutorials for learning Python.
These platforms can provide step-by-step instructions, coding challenges, and interactive coding environments where learners can practice writing and executing Python code.
AI algorithms can analyze learners' code and provide immediate feedback, helping them identify and correct errors.
Intelligent Code Autocompletion: AI-based code editors and integrated development environments (IDEs) can offer intelligent code autocompletion suggestions while programming in Python.
These suggestions are based on context, syntax, and common programming patterns.
AI-powered autocompletion can help learners explore different options, reduce syntax errors, and improve coding efficiency.
Error Detection and Debugging: AI can assist in detecting and debugging errors in Python code.
By analyzing code syntax, structure, and runtime behavior, AI algorithms can identify potential errors, offer suggestions for correction, and provide explanations for common mistakes.
This helps learners understand and resolve coding issues more effectively.
Code Generation and Examples: AI can generate Python code snippets or complete functions based on specified requirements or desired outcomes.
This can be particularly helpful for beginners who are learning the language and need assistance with writing correct and functional code.
AI can also provide real-life examples of Python code usage in various applications and domains.
Natural Language Processing (NLP): AI-powered NLP capabilities can aid in understanding Python documentation, tutorials, and forums.
NLP algorithms can analyze and interpret text-based resources, extract relevant information, and provide explanations in a more accessible and understandable format.
This can assist learners in comprehending complex programming concepts and syntax.
Intelligent Recommendations: AI algorithms can recommend relevant learning resources, tutorials, and projects based on learners' proficiency level, interests, and areas of improvement.
These recommendations can help learners discover additional learning materials, practice Python in different contexts, and explore advanced topics at their own pace.
Collaborative Learning and Coding Communities: AI can facilitate collaborative learning and coding communities by connecting learners with peers, mentors, and experts in Python programming.
AI-powered platforms can match learners with similar interests or skill levels for group projects, coding challenges, and code reviews.
This fosters an environment of peer support, knowledge sharing, and collective learning.
AI-based Python Libraries and Frameworks: AI libraries and frameworks like TensorFlow, PyTorch, and scikit-learn provide powerful tools for developing AI and machine learning applications in Python.
Learning these libraries and frameworks can open up opportunities to explore and apply AI techniques within Python programming.
It's important to note that while AI can assist in learning Python, hands-on practice, active problem-solving, and engagement with programming exercises and projects remain crucial for developing programming skills.
AI serves as a supportive tool to enhance the learning experience, but it should not replace practical coding experience and conceptual understanding.
Here are a few interactive learning platforms that can help you learn Python programming:
Codecademy (www.codecademy.com): Codecademy offers interactive Python courses that guide learners through coding exercises, projects, and quizzes.
The platform provides a hands-on learning experience and covers topics ranging from Python basics to advanced concepts.
Coursera (www.coursera.org): Coursera hosts a variety of Python programming courses offered by universities and institutions worldwide.
These courses often include interactive coding exercises, video lectures, and assignments to reinforce learning.
DataCamp (www.datacamp.com): DataCamp specializes in data science and offers interactive Python courses focused on data analysis, visualization, and machine learning.
The platform provides a learn-by-doing approach with coding exercises and real-world projects.
edX (www.edx.org): edX offers Python courses from renowned universities and institutions.
These courses cover Python fundamentals, web development, data science, and more.
The platform provides interactive coding exercises and assessments to test your knowledge.
SoloLearn (www.sololearn.com): SoloLearn offers a mobile app and web platform with interactive Python courses.
The courses are designed in a gamified format, allowing learners to earn points, compete with peers, and practice coding challenges.
Codewars (www.codewars.com): Codewars provides a platform for users to solve coding challenges in various programming languages, including Python.
You can choose Python-specific challenges of different difficulty levels and learn from community solutions.
JetBrains Academy (www.jetbrains.com/academy): JetBrains Academy offers an interactive learning platform with Python courses and projects.
The platform provides an integrated development environment (IDE) and offers step-by-step guidance for learning Python and building real-world applications.
Remember, while these platforms provide interactive learning experiences, it\'s important to practice coding regularly, work on projects, and engage in problem-solving to solidify your Python programming skills.
create new project process
create dir
cd dir
git init
git remote add origin git@github.com:$USERNAME/$1.git
touch README.md
git add .
git commit -m "Initial commit"
git push -u origin master
code .
pip install pygubu-designer
Pygubu is a RAD tool to enable quick and easy development of user interfaces for the Python's tkinter module.
The user interfaces designed are saved as XML files, and, by using the pygubu builder, these can be loaded by applications dynamically as needed.
https://github.com/alejandroautalan/pygubu-designer
Usage
Type on the terminal the following commands.
C:\Python3\Scripts\pygubu-designer.exe
Where C:\Python3 is the path to your Python installation directory.
[WinError 2] No such file or directory
The system cannot find the file specified: 'C:\\Python311\\Scripts\\chardetect.exe' -> 'C:\\Python311\\Scripts\\chardetect.exe.deleteme'
Try running the command as administrator:
or
pip install numpy --user to install numpy without any special previlages
Pytube is a lightweight, Python-written library.
Pytube provides a command-line feature that allows you to stream videos directly from the terminal easily.
To import pytube, we can use the commands according to the python version.
For Python2 : pip install pytube
For Python3 : pip3 install pytube
For pyube3 : pip install pytube3
To save the audio file, we are using the os module:
pip install os_sys
Procedure:
First, we need to import the required (pytube and os) module.
Implementation: Python3
# importing packages
from pytube import YouTube
import os
# url input from user
yt = YouTube(
str(input("Enter the URL of the video you want to download: \n>> ")))
# extract only audio
video = yt.streams.filter(only_audio=True).first()
# check for destination to save file
print("Enter the destination (leave blank for current directory)")
destination = str(input(">> ")) or '.'
# download the file
out_file = video.download(output_path=destination)
# save the file
base, ext = os.path.splitext(out_file)
new_file = base + '.mp3'
os.rename(out_file, new_file)
# result of success
print(yt.title + " has been successfully downloaded.")
Cut MP3 file
Before we go forward, we need to install FFmpeg in your system as it is required to deal with mp3 files to download you can visit this site: https://phoenixnap.com/kb/ffmpeg-windows.
Also, we will use pydub library to perform this task.
pip install pydub
Step 1: Open an mp3 file using pydub.
from pydub import AudioSegment
song = AudioSegment.from_mp3("test.mp3")
Step 2: Slice audio
# pydub does things in milliseconds
ten_seconds = 10 * 1000
first_10_seconds = song[:ten_seconds]
last_5_seconds = song[-5000:]
Step 3: Save the results as a new file in mp3 audio format.
first_10_seconds.export("new.mp3", format="mp3")
Example:
from pydub import AudioSegment
# Open an mp3 file
song = AudioSegment.from_file("testing.mp3",
format="mp3")
# pydub does things in milliseconds
ten_seconds = 10 * 1000
# song clip of 10 seconds from starting
first_10_seconds = song[:ten_seconds]
# save file
first_10_seconds.export("first_10_seconds.mp3",
format="mp3")
print("New Audio file is created and saved")
print("Content that you wanna print on screen")
var1 = "Shruti"
print("Hi my name is: ",var1)
Taking Input From the User
var1 = input("Enter your name: ")
print("My name is: ", var1)
To take input as an integer:
var1=int(input("enter the integer value"))
print(var1)
To take input as an float:
var1=float(input("enter the float value"))
print(var1)
range Function
range function returns a sequence of numbers, eg, numbers starting from 0 to n-1 for range(0, n)
range(int_start_value,int_stop_value,int_step_value)
Here the start value and step value are by default 1 if not mentioned by the programmer. but int_stop_value is the compulsory parameter in range function
example
Display all even numbers between 1 to 100
for i in range(0,101,2):
print(i)
Comments
Comments are used to make the code more understandable for programmers, and they are not executed by compiler or interpreter.
Single line comment
# This is a single line comment
Multi-line comment
'''This is a
multi-line
comment'''
Escape Sequence
An escape sequence is a sequence of characters; it doesn't represent itself (but is translated into another character) when used inside string literal or character. Some of the escape sequence characters are as follows:
Newline
Newline Character
print("\n")
Backslash
It adds a backslash
print("\\")
Single Quote
It adds a single quotation mark
print("\'")
Tab
It gives a tab space
print("\t")
Backspace
It adds a backspace
print("\b")
Octal value
It represents the value of an octal number
print("\ooo")
Hex value
It represents the value of a hex number
print("\xhh")
Carriage Return
Carriage return or \r will just work as you have shifted your cursor to the beginning of the string or line.
pint("\r")
Strings
Python string is a sequence of characters, and each character can be individually accessed using its index.
String
You can create Strings by enclosing text in both forms of quotes - single quotes or double quotes.
variable_name = "String Data"
example
str="Shruti"
print("string is ",str)
Indexing
The position of every character placed in the string starts from 0th position ans step by step it ends at length-1 position
Slicing
Slicing refers to obtaining a sub-string from the given string. The following code will include index 1, 2, 3, and 4 for the variable named var_name
Slicing of the string can be obtained by the following syntax
string_var[int_start_value:int_stop_value:int_step_value]
var_name[1 : 5]
here start and step value are considered 0 and 1 respectively if not mentioned by the programmmer
isalnum() method
Returns True if all the characters in the string are alphanumeric, else False
string_variable.isalnum()
isalpha() method
Returns True if all the characters in the string are alphabets
string_variable.isalpha()
isdecimal() method
Returns True if all the characters in the string are decimals
string_variable.isdecimal()
isdigit() method
Returns True if all the characters in the string are digits
string_variable.isdigit()
islower() method
Returns True if all characters in the string are lower case
string_variable.islower()
isspace() method
Returns True if all characters in the string are whitespaces
string_variable.isspace()
isupper() method
Returns True if all characters in the string are upper case
string_variable.isupper()
lower() method
Converts a string into lower case equivalent
string_variable.lower()
upper() method
Converts a string into upper case equivalent
string_variable.upper()
strip() method
It removes leading and trailing spaces in the string
string_variable.strip()
List
A List in Python represents a list of comma-separated values of any data type between square brackets.
var_name = [element1, element2, ...]
These elements can be of different datatypes
Indexing
The position of every elements placed in the string starts from 0th position ans step by step it ends at length-1 position
List is ordered,indexed,mutable and most flexible and dynamic collection of elements in python.
Empty List
This method allows you to create an empty list
my_list = []
index method
Returns the index of the first element with the specified value
list.index(element)
append method
Adds an element at the end of the list
list.append(element)
extend method
Add the elements of a given list (or any iterable) to the end of the current list
list.extend(iterable)
insert method
Adds an element at the specified position
list.insert(position, element)
pop method
Removes the element at the specified position and returns it
list.pop(position)
remove method
The remove() method removes the first occurrence of a given item from the list
list.remove(element)
clear method
Removes all the elements from the list
list.clear()
count method
Returns the number of elements with the specified value
list.count(value)
reverse method
Reverses the order of the list
list.reverse()
sort method
Sorts the list
list.sort(reverse=True|False)
Tuples
Tuples are represented as comma-separated values of any data type within parentheses.
Tuple Creation
variable_name = (element1, element2, ...)
These elements can be of different datatypes
Indexing
The position of every elements placed in the string starts from 0th position ans step by step it ends at length-1 position
Tuples are ordered,indexing,immutable and most secured collection of elements
Lets talk about some of the tuple methods:
count method
It returns the number of times a specified value occurs in a tuple
tuple.count(value)
index method
It searches the tuple for a specified value and returns the position.
tuple.index(value)
Sets
A set is a collection of multiple values which is both unordered and unindexed. It is written in curly brackets.
Set Creation: Way 1
var_name = {element1, element2, ...}
Set Creation: Way 2
var_name = set([element1, element2, ...])
Set is unordered,immutable,non-indexed type of collection.Duplicate elements are not allowed in sets.
Set Methods
Lets talk about some of the methods of sets:
add() method
Adds an element to a set
set.add(element)
clear() method
Remove all elements from a set
set.clear()
discard() method
Removes the specified item from the set
set.discard(value)
intersection() method
Returns intersection of two or more sets
set.intersection(set1, set2 ... etc)
issubset() method
Checks if a set is a subset of another set
set.issubset(set)
pop() method
Removes an element from the set
set.pop()
remove() method
Removes the specified element from the set
set.remove(item)
union() method
Returns the union of two or more sets
set.union(set1, set2...)
Dictionaries
The dictionary is an unordered set of comma-separated key:value pairs, within {}, with the requirement that within a dictionary, no two keys can be the same.
Dictionary
<dictionary-name> = {<key>: value, <key>: value ...}
Dictionary is ordered and mutable collection of elements.Dictionary allows duplicate values but not duplicate keys.
Empty Dictionary
By putting two curly braces, you can create a blank dictionary
mydict={}
Adding Element to a dictionary
By this method, one can add new elements to the dictionary
<dictionary>[<key>] = <value>
Updating Element in a dictionary
If a specified key already exists, then its value will get updated
<dictionary>[<key>] = <value>
Deleting an element from a dictionary
del keyword is used to delete a specified key:value pair from the dictionary as follows:
del <dictionary>[<key>]
Dictionary Functions & Methods
Below are some of the methods of dictionaries
len() method
It returns the length of the dictionary, i.e., the count of elements (key: value pairs) in the dictionary
len(dictionary)
clear() method
Removes all the elements from the dictionary
dictionary.clear()
get() method
Returns the value of the specified key
dictionary.get(keyname)
items() method
Returns a list containing a tuple for each key-value pair
dictionary.items()
keys() method
Returns a list containing the dictionary's keys
dictionary.keys()
values() method
Returns a list of all the values in the dictionary
dictionary.values()
update() method
Updates the dictionary with the specified key-value pairs
dictionary.update(iterable)
Indentation
In Python, indentation means the code is written with some spaces or tabs into many different blocks of code to indent it so that the interpreter can easily execute the Python code.
Indentation is applied on conditional statements and loop control statements. Indent specifies the block of code that is to be executed depending on the conditions.
Conditional Statements
The if, elif and else statements are the conditional statements in Python, and these implement selection constructs (decision constructs).
if (conditional expression):
statements
elif (conditional expression):
statements
else:
statements
Nested if-else Statement
if (conditional expression):
if (conditional expression):
statements
else:
statements
else:
statements
example
a=15
b=20
c=12
if(a>b and a>c):
print(a,"is greatest")
elif(b>c and b>a):
print(b," is greatest")
else:
print(c,"is greatest")
Loops in Python
A loop or iteration statement repeatedly executes a statement, known as the loop body, until the controlling expression is false (0).
For Loop
The for loop of Python is designed to process the items of any sequence, such as a list or a string, one by one.
for <variable> in <sequence>:
statements_to_repeat
example
for i in range(1,101,1):
print(i)
While Loop
A while loop is a conditional loop that will repeat the instructions within itself as long as a conditional remains true.
while <logical-expression>:
loop-body
example
i=1
while(i<=100):
print(i)
i=i+1
Break Statement
The break statement enables a program to skip over a part of the code. A break statement terminates the very loop it lies within.
for <var> in <sequence>:
statement1
if <condition>:
break
statement2
statement_after_loop
example
for i in range(1,101,1):
print(i ,end=" ")
if(i==50):
break
else:
print("Mississippi")
print("Thank you")
Continue Statement
The continue statement skips the rest of the loop statements and causes the next iteration to occur.
for <var> in <sequence>:
statement1
if <condition> :
continue
statement2
statement3
statement4
example
for i in [2,3,4,6,8,0]:
if (i%2!=0):
continue
print(i)
Functions
A function is a block of code that performs a specific task. You can pass parameters into a function. It helps us to make our code more organized and manageable.
Function Definition
def my_function():
#statements
def keyword is used before defining the function.
Function Call
my_function()
Whenever we need that block of code in our program simply call that function name whenever neeeded. If parameters are passed during defing the function we have to pass the parameters while calling that function
example
def add(): #function defination
a=10
b=20
print(a+b)
add() #function call
Return statement in Python function
The function return statement return the specified value or data item to the caller.
return [value/expression]
Arguments in python function
Arguments are the values passed inside the parenthesis of the function while defining as well as while calling.
def my_function(arg1,arg2,arg3....argn):
#statements
my_function(arg1,arg2,arg3....argn)
example
def add(a,b):
return a+b
x=add(7,8)
print(x)
File Handling
File handling refers to reading or writing data from files. Python provides some functions that allow us to manipulate data in the files.
open() function
var_name = open("file name", " mode")
modes-
r - to read the content from file
w - to write the content into file
a - to append the existing content into file
r+: To read and write data into the file. The previous data in the file will be overridden.
w+: To write and read data. It will override existing data.
a+: To append and read data from the file. It won’t override existing data.
close() function
var_name.close()
read () function
The read functions contains different methods, read(),readline() and readlines()
read() #return one big string
It returns a list of lines
readlines() #returns a list
It returns one line at a time
readline #returns one line at a time
write function
This function writes a sequence of strings to the file.
write() #Used to write a fixed sequence of characters to a file
It is used to write a list of strings
writelines()
Exception Handling
An exception is an unusual condition that results in an interruption in the flow of a program.
try and except
A basic try-catch block in python. When the try block throws an error, the control goes to the except block.
try:
[Statement body block]
raise Exception()
except Exceptionname:
[Error processing block]
else
The else block is executed if the try block have not raise any exception and code had been running successfully
try:
#statements
except:
#statements
else:
#statements
finally
Finally block will be executed even if try block of code has been running successsfully or except block of code is been executed. finally block of code will be executed compulsory
Object Oriented Programming (OOPS)
It is a programming approach that primarily focuses on using objects and classes. The objects can be any real-world entities.
class
The syntax for writing a class in python
class class_name:
pass #statements
Creating an object
Instantiating an object can be done as follows:
<object-name> = <class-name>(<arguments>)
self parameter
The self parameter is the first parameter of any function present in the class. It can be of different name but this parameter is must while defining any function into class as it is used to access other data members of the class
class with a constructor
Constructor is the special function of the class which is used to initialize the objects. The syntax for writing a class with the constructor in python
class CodeWithHarry:
# Default constructor
def __init__(self):
self.name = "CodeWithHarry"
# A method for printing data members
def print_me(self):
print(self.name)
Inheritance in python
By using inheritance, we can create a class which uses all the properties and behavior of another class. The new class is known as a derived class or child class, and the one whose properties are acquired is known as a base class or parent class.
It provides the re-usability of the code.
class Base_class:
pass
class derived_class(Base_class):
pass
Types of inheritance-
Single inheritance
Multiple inheritance
Multilevel inheritance
Hierarchical inheritance
filter function
The filter function allows you to process an iterable and extract those items that satisfy a given condition
filter(function, iterable)
issubclass function
Used to find whether a class is a subclass of a given class or not as follows
issubclass(obj, classinfo) # returns true if obj is a subclass of classinfo
Iterators and Generators
Here are some of the advanced PythonCheatSheettopics of the Python programming language like iterators and generators
Iterator
Used to create an iterator over an iterable
iter_list = iter(['Harry', 'Aakash', 'Rohan'])
print(next(iter_list))
print(next(iter_list))
print(next(iter_list))
Generator
Used to generate values on the fly
# A simple generator function
def my_gen():
n = 1
print('This is printed first')
# Generator function contains yield statements
yield n
n += 1
print('This is printed second')
yield n
n += 1
print('This is printed at last')
yield n
Decorators
Decorators are used to modifying the behavior of a function or a class. They are usually called before the definition of a function you want to decorate.
property Decorator (getter)
@property
def name(self):
return self.__name
setter Decorator
It is used to set the property 'name'
@name.setter
def name(self, value):
self.__name=value
deleter Decorator
It is used to delete the property 'name'
@name.deleter #property-name.deleter decorator
def name(self, value):
print('Deleting..')
del self.__name