Data Focused Python
Lecture 1
Variable Names
A legal variable name consist of * A letter r an underscore(_) * Followed by 0 or more letters and/or decimal digits (0-9) and/or underscores
Low-Level Scalar Types
- int
- float
- str
- bytes
- bool
- None
Arithmetic Operators
** # exponentiation
+ - # unary plus, minus
* / // % # multiply, divide, "floor" divide (truncate toward -infinity), modulus (reminder)
+ - # binary plus minues
# use (...) for grouping
"hello"
'hello'
"hello" + 'hello'
"'hello'" + '"hello"'
"0123456789" * 3 # repetition
'''hello'''
"""hello""" # multiple line quotes
True
False
Arithmetic Operator Associativity - all binary arthmetic operators associate left-to-right, except **, which is right-to-left
2 ** 3 ** 2 = 2 ** (3 ** 2)
9 / 3 * 2 = (9 / 3) * 2
a = 2
b = 4
a **= b # a = a ** b
a -= 5 # a = a - 5
Collection Built-In Types
list * items are indexed from 0 to n-1 * or in reverse from -1 to -n
m = [7, 2 , 3 ,0]
Items of a list can be modified:
[7, 2, 3, 0]
m[2] = 5
[7, 2, 5, 0]
"+" and "*" works as with str objects:
m + [4, 7, 1]
[7, 2, 5, 0, 4, 7, 1]
[0] * 5
[0, 0, 0, 0, 0]
A list supports many named operations, including:
append(val) # "like" push_back, always add to the last
insert(idx, val) # insert ahead of "idx"
remove(val) # first match of val
pop([idx]) # return item at idx, or at n-1 if no idx, and remove that item
count(val)
sort() # ascending by default
reverse()
Logical and Physical Lines
- A statement must be contained on a single logical line
- A physical line ends with NEWLINE
- It is a syntax error to split a logical line across multiple physical lines
- unless the physical line ends with \
- or the code in within (), [], or {}
x = \
12
x = (
12
)
Multiple Statements on One Physical Line
- Use ; between statements on a physical line
a = 4; b = 12.5; c = 'x'
Length: Len
- Use "len" to get the number of items in any collection (including a str)
len([3, 2, 4, 4])
4
len('hello')
5
Sequence Slices
- "list" and "str" are squence types
- A slice of a sequence is obtained with seq[i:j:k] # from item i, to not including item j, with step size k
- In seq[i:j:k], missing :k implies :1
- missing i implies 0
- missing j implies len(seq)
ages = [3, 12, 5, 33, 68]
ages[0:3:1]
[3, 12, 5]
ages[0:2]
[3, 12]
ages[:4]
[3, 12, 5, 33]
ages [1:]
[12, 5, 33, 68]
a = [0, 1, 2, 3, 4, 5, 6, 7]
b = a[4:] + a[1:4]
b
[4, 5, 6, 7, 1, 2, 3]
'hello'[:3] + 'p!'
'help!'
s1 = 'international'
s1[::2]
'itrain1'
't' + s1[:8:2] + s1[7:11]
'titration'
Object Identity: id
- Alomost anything in Python is an object
- An int, a float, a bool, a str, None, a list, a tuple, a set, ...
- A function, a class, an iterator,...
- Each object is uniquely identified by its id
- The id may or maynot be the memory address, depending on the phython implementation
- Collections can have equal items without having the same ids
id(n)
x = 7
id(x)
y = 7
id (y) # same id as "x" and "7"
m= [1,2,3]
id(m)
n= [1,2,3]
id(n) # different id as m
n = m
id(n) # now the same
Equality, Relational and Logical Operators
- From high precedence to low precedence
- Notice keywords rather than operator symbols
- Not the same precedence as in Java, C, C++
- Use(...) for grouping
- Unlike C/C++/Java, equality and relational operators may be chained with the expected mathematical meaning, e.g.:
- a > b > c means a > b and b > c
- Evaluation here is left to right
- If a > b is False, then b > c is not evaluated
== != < <= > >= is is not
not
and
or
Identity vs. Value Comparison
- "is" and "is not" compare object ids
- == and != compare object values
The if Decision, and Identation
- The Phython 3 if decision is of the form
if bool_expr: statement1 ... # optional statements blank line # this concludes the if
- Indentation of statement1 is required
- Additional states, if any, must be indented y exactly the same amount
- Even identing the first character in a top-level line is an error
- Line and Indentation Rules: Good or bad?
- Good:
- Forces a common indentation scheme for all programmers
- No curly braces to keep track of
- Not so good:
- Line continuation is clunky
- In long code, hard to see both beginning of id and end of id on the same screen
Using the IDLE Editor
- Click File / New File to create a new file
- Later, click Run / Run Module to execute the code
The print Function
- In the interactive shell, the value of a typed expression (other than an assignment) is automatically displayed
- In code in a file, you can use the print function to display output
- By default, displays values separated with spaces
print(a, 'Bob', 11/3, True)
hello Bob 3.666666666665 True
Keyword Arguments
- Many functions provide keyword arguments
- Of the form name=value
- Must follow all of the positional arguments
a = 'hello'
print(a, 'Bob', 11/3, True, sep=',') # change the seperater (keyword argument)
hello, Bob, 3.6666665, True
General if/elif/else Decisions
- A decision must start with one if part
- Optionally follow by zero or more elif parts
- Optionally follow by zero or more else parts
if bool-expr:
stmt
elif bool_expr:
stmt
else:
stmt
if a == 'hello':
print(a, 'is equal to \'hello\'')
elif b < c:
print(b, 'is less than', c)
else:
print('Noe of the above!')
General while Loops
- A while loop is straightforward
while bool_expr:
stmt
i = 0
while 1 <10:
print(i, end='') # a space will be displayed instead of a new line
i += 1
Iterating with for Loops
- A for loop steps through each item in an iterable, such as a sequence object
for var in iterable:
stmt
for i in [1, 5, 9, -4, 12]:
print(i ** 2)
for c in 'hello':
print(c)
The range Function
- The range function provides a useful iterable
for var in range(N): # 0, 1, 2, ..., N-1
stmt
for var in range(M,N): # M, M+1, M+2, ..., N-1
stmt
for var in range(M,N,S): # M, M+s, M+2S, ..., <N
stmt
for var in range(5):
print(i) # 0, 1, 2, 3, 4
for var in range(4,-1,-1):
print(i) # 4, 3, 2, 1, 0
Modules
- Python uses modules of code for extended capabilities
- Modules, or module items, can be imported into your code
- For example, the math module contains many common mathematical functions and values
The math Module
import math
math.exp(1)
import math as m
m.exp(1)
m.cos(0)
m.sqrt(2)
m.log(1234)
from math import exp
exp(1) # if you just need exp()
from math import * # very risky! everything
exp(sqrt(sin(pi))) # all math available???
pi
sin(pi)
sqrt(sin(pi))
exp(sqrt(sin(pi)))
Copying a Text File
- Quick and dirty: we will do better than this
fin = open('/dir1/dir2/in.txt','rt', encoding ='utf-8') # input file, write text
fout = open('/dir1/dir2/out.txt','wt', encoding ='utf-8') # output file, create or overwritten file
for line in fin: #line is a str
fout.write(line) # write the line into the output file
fint.close() # less important to close input file
fout.close() # must close output file
# \ means another line
# / direcrtory name seperator
# 'rt' - read text
Lecture 2: More Collections, Type Conversion, and Web Scraping
Collection Build-In Types
- Python 3.7 provides
list # "like" an array tuple # "like" a record set # just one of each value frozenset # (we won't care)
- A str can also be used as a collection
turple Example
- A turple is enclosed in (...)
- Items are indexed from 0 to n-1
- Use [inx] to access an item
- Or in reverse from -1 to -n
- Items are indexed from 0 to n-1
n = ('a', 'golden', 'eage', 'soars')
type(n)
n
n[2]
n = (n[0], 'bald', n[2], n[3])
n
- To assign turple items to seperate variables
- Called sequence unpacking
a,b,c,d = n
c
- You can construct a turple without parens
- Called tuple packing
t = 1, 4, 'hi', 4.6
t
Multiple Assignment
- Multiple assignment combines tuple packing and sequence unpacking
a, b, c, d = 'hi', 12.6, True, 9
- To swap the value of two variables
a = -7
b = 3
b, a = a, b
print (a, b)
Empty and One-Item tuples
- An empty tuple can be represented as ()
t = ()
type(t)
t
- But parens are also used for grouping
- (value) is simply value
t = (6)
t
- The goofy (value,) creates a one-item tuple
t = (6,) # otherwise it's an int
t
len(t)
tuple Slides and Concatenation
- Like a ist or str, a tuple is a sequence, supporting slices
- Another way to construct a tuple from an existing tuple:
n
n = n[:1] + ('harpy',) + n[2:] # everything up to but not includ 1/2
n
set Example
- A set is enclosed in {...}
- Items are unsorted (although may appear sorted when displayed), with no duplicates
s = { 1, 6, 5, 9, 2, 1, 6, 3}
type(s)
s
set Named Operations
- set provides many named operations for manipulating items, including
add(val) # add val to set
discard(val) # remove val from set, if val is a member
remove(val) # remove val, and fail if val is not a member
pop() # remove and return an arbitrary item
clear() # remove all items
val in s # val is in s True/Flase
s
s.add(7)
s.add(3) # 3 is already in set, therefore is not added
s.discard(7) # 7 is gone
s.discard(13) # 13 is not there
s.remove(2) # 2 is removed
s.remove(13) # fails as there's no 13
s.pop()
s.add(1)
s.add(3)
s.add(13)
s.add(6) # 6 is already there so get ignored
set vs set Operations
- set provides "the usual suspects" of set vs. set operatios (methods)
s1.differences(s2) # s1-s2
s1.symmetric_difference(s2) # (s1-s2) and (s2-s1) unnion
s1.isdisjoint(s2) # s1 and s2 intersection
s1.issubset(s2) # s1 is subset of s2, True or False
s1.issuperset(s2) # s2 is subset of s1, True or False
s1.union(s2) # s1 and s2 union
s1 - s2 # remove intersection
set Symbolic Operations
- set provides some symobic (borrowed from C/C++) operations that can be used rather than named operations
- # set difference
& # "and" (intersection)
^ # "xor" (symmetric difference)
| # "or" (union)
# from highest to lowest precedence
Remember: All Generic, All The Time
- A set can contain any hashable items
- Scalar: int, float, bool, str, None
- tuples if all items are hasgable
- Not lists or sets, since these are mutable 可变的
s3 = {5, 4.7, None, 'hello', True, (2,9,14)}
s3
The dict Collection Type
- Python 3.7 also provides
dict # dictionary
- A dict consists of key: value pairs, enclosed in {...}
- Keys must be hashable
- No restrictuons on values
n2e = {'john': 'jkstlund@gmail.com',
'al': 'al@alcorp.net',
'bob': 'bob@bassoc.com'}
type(n2e)
n2e['cy'] = 'cy@nou.edu' # add one more item
# store in the order of key creation
n2e['john'] = 'jkstlund@andrew.cmu.edu' # the value is changed
dict Operations
- dict provides many named operations for manipulating items, including
get(key[,def]) # return value for key, or def if key does not exist, or None if def is not provided
popitem() # return some (key, value) pair as a tuple
pop(key[, def]) # return value for key, and remove (key, value) from doct if key not found, def or fail
clear # remove all items
dict Iterables
- doct provides three iterables that make it easy to loop through keys, values, or items
keys() # iterable over all dict keys
values() # iterable over all dict values
items() # iterable over all dict (key,values) tuples
n2e. keys()
n2e. values()
n2e. items()
for k in n2e.keys():
print(k)
n2e['dave'] = 'dave@dave.org'
n2e
for j in n2e.keys():
print(k)
for i in n2e.items():
print(i)
Empty and One-Item sets and dics
- Both set and dict have items enclosed in {}
de = {} # empty dict
type(de)
de
se = set() # empty set
type(se)
se
d1 = {key: value} # one-item dict
d1
s1 = {value} # one-item set
si
- set and dict are not sequences
- No slicing with [m:n]
Conversions Among Low-Level Built-In Types
- Virtually all objects can be converted to strong via str
- Conversions among int, float, and bool work "within reason**
str(None)
str(12.354e8)
str(n2ek)
int(4.567)
4
int(-4.567) # truncate, not floor!
-
int('456')
456
int(True)
1
float(5)
5.0
float('5.432')
5.432
bool(4321)
True
bool(0.0)
False
bool('')
False
bool(' ')
True
A Convenient Web Scraping Module: BeautifulSoup
- To "scrape" a website
- Open a connection to the web page
- Download the source (text) of the web page
- Concert the source to an HTML-aware BeautifulSoup object
- Write the BeautifulSoup to a file, and examine the contents for HTML tags:
... - Extract the tagged information you want from the BeautifulSoup
Scaping the Yield Curve
from urllib.request import urlopen # b_soup1.py
from bs4 import BeautifulSoup
html = urlopn('')
bsyc = Beautifulsoup(html.read(), "lxml)
fout = open('bsyc_temp.txt', 'wt', encoding='utf-8')
fout.write(str(bsyc))
fout.close()
- Searching bsyc_temp.txt for 08/01/19, we are lucky to only find it once
<td class="text_view_data"
scope="row">08/01/19</td>
- The tag is td
- Googling HTML td tells us this is a cell in a table
- Searching backward for
, we find
# print the first table print(str(bsyc.table)) # ... not the one we want # so get a list of all table tags table_list = bsyc.findAll('table') # how many are there? print('there are', len(table_list), 'table tags') # look at the first 50 chars of each table for t in table_list: print(str(t)[:50]) # only one class="t-chart" table, so add that to findAll as a dictionary attribute tc_table_list = bsyc.findAll('table', {"class": "t-chart"}) # how many are there? print (len(tc_table_list), 't-shart table') # only 1 t-chart table, so grab it tc_table = tc_table_list[0] # what are this table's components/children? for c in tc_table.children: print(str(c)[:50]) # tag tr means table row, containing table data # what are the children of those rows? for c in tc_table.children: for r in c.children: print(str(r)[:50]) # we have found the table data! # just get the contents of each cell for c in tc_table.children: for r in c.children: print(r.content)
Lecture 3: Construction and Comprehension, Exceptions, User Input, Functions, Modules, and Intro to Numpy
list, tuple, and set Construction
- list, tuple, and set objects can be constructed from iterables
tup1 = tuple('this is a test') tup1 ('t', 'h',....) s1 = set(tup1) s1 {'t', 'h',....} ls1 = list(s1) ls1.sort() ls1 ['t', 'h',....]
dict Construction
- A dict object can be constructed from an iterable on 2-tuples
set_of_2tups = {('a', 12), ('b', 22)} set_of_2tups {(('b', 22),'a', 12)} d1 = dict(set_of_2tups) d1 {('b': 22), ('a': 12)}
The zip() Function
- The zip() function zips two (or more) iterables together into iterable on tuples
- Handy for creating a dict from two iterables
ls1 [' ', 'a', 'e', 'h', 'i', 's', 't'] d2 = dict(zip(ls1, range(len(ls1)))) {' ': 0, 'a': 1, 'e': 2,..., 't': 6}
Comprehensions
- A comprehension is a concide way of building a lis, tuple, set, or dict
- It is "Pythnic", meaning cool
- Comprehensions can be clear... or very obscure
A list Comprehensions
- The brute force way to create a list of int values from 0 through 15:
m20 = [0, 1, 2, 3, 4, 5, 6,..., 15]
A list From a for Loop
- An easier and less error-prone way
m20 =[] # start empty for v in range(16): m20.append(v)
A list Comprehension
- A list comprehension puts the for loop inside the list
m20 = [v for v in range(16)] # print 0 to 15 m21 = [0 for v in range(8)] m22 = [v**2 for v in range(8)] m23 = [v/2 for v in range(10) if v % 2 ==1] import ath as m m24 = [m.cos(m.pi * v/4 for v in range(8))] m25 = [(v**(1/3),v**.5, v, v**2, v**3) for v in range(9)] # a list of tuples [expr for var(s) in iter [for_or_if...]] # in general
set Comprehension
- A set comprehension is like a list comprehension
- Use {} rather than []
s20 = {v..... for loop...}
dict Comprehension
- A dict comprehension can also use {}
- Items must be specified with key: value notation
d20 = {k: k**2 for k in range(8)}
Exception
- Many program errors rasie exceptions
7 / 0 # zero division error d = dict(5) # type error x = float('12.3456') # value error fin = open('/foo/asdf', 'rt') # file not found error
try... except
- You can try a block of statements
- If an exception eccurs, use except to capture and handle the exception
- except in this form handles any kind of exception
- If an exception eccurs, use except to capture and handle the exception
try: val = 7/0 except: # handle any exception val = -1.0 val -1.0 except FileNotFoundError: # a specific kind of error
Handling User Input
- Users make all kinds of errors
- Use input(prompt) to read user input as a string
- Convert to desired type: int, float,...
- Use try...except to deal with formatting errors
- Use normal logic to deal with range errors
answer = input('Please enter your name: ') age = input ('Please enter your age: ') age_val = float(age) age_bad = True # user_age.py age = 0.0 while age_bad: try: age_str = input("Enter your age: ") age = float(age_str) except: print("Bad age format") age = -1.0 if not 0.0 < age <= 125.0: print("Enter value in [0.0, 125.0]) else: age_bad = False print("Age is", age)
Defining and Calling Functions
- A function definition has this form, in which
- p1, p2,..., are optional positional parameters
- n1=v1,n2=v2,... are optional so-called keyword parameters and their corresponding default values
def say_hi(): print("hi") def func_name(p1,p2,..., n1=v1, n2=v2,...): stmt def ret_pow(x, y=2): # positional argument vs. keyword argument return x ** y
Variadic Functions
- A variadic function is a function that can be called with a varying number of arguments.
- A function can be defined to receive a varying number of positional arguments via the notation *args after the required positional arguments
- Actually, any identfier can be used: args is conventional
Variadic Positional Arguments
- Within the body, args is a tuple of all trailing argument values
def very_fun (a, b, c, *args): print (a,b,c, args) # args is an empty turple very_fun (1, 2, 3, 4, 5, 6, 7, 8, 9) # 4-9 will be added to the turple def mysum(*args): print(args) sum=0 for v in args: sum += v return sum mysum(1,2,3,4,5,6)
Variadic Keyword (Named) Arguments
- A function can be defined to receive a varying number of keyword (named) arguments via the notation **kwargs
- The name **kwargs is concentional
- **kwargs must come after positional and other keyword arguments, if any
- Within the body, kwargs is a dict of all trailing keyword arguments and values
def vf2 (a, *args, b=42, **kwargs): # kwargs is an empty dictionary print(a, args, b, kwargs) vf2 (1,2,3,4,b=5,arg=6,ht=70,nm='Joe')
About Modules
- A module file contains Python code (file.py)
- A package is a hierachically structured collection of related modules - beyond our scope
- When you run a module in IDLE or some other Python IDE, that is the main module
- It may import and use all or parts of other modules
- An interactive Python shell considers itself the main module
# this is my test module, testmod.py import math as m # import other modules print('I am the testmod.py module') print('The value of pi is (approximately):', m.pi)
Module Contents
- A module may define
- Variables (like pi or e)
- Functions (like sqrt or cos)
- Classes (like BinaryTree)
- The name of the module is simply the name of the code file, with the .py removed
- mystuff,py contains the mystuff module
The name Vaiable
- Within any module, variable name is set to the name of the module
- The name of the main module is 'main'
- In the interactive shell
__name__ var()
Module Test Code
- For development and testing, code that "just runs" can be placed near the end of the module, like so:
# mymodule.py def fun1(): ... c_num = '95888' if __name__ == '__main__': fun1() # test call of fun1 print(c_num) # display c_num value
- If mymodule.py is run from within IDLE or another IDE, the code following if name == 'main': will be executed
- But if some other module does import mymodule, the test code will not be executed, because: name == 'mymodule'
Example: mymath.py
# mymath.py def sqrt(x): return x ** .5 def cube(x): return x ** 3 def mysum(*args): x = 0 for v in args: x += v return x if __name__ == '__main__': print('module name is: ', __name__) print('square root of 3: ', sqrt(3)) print('sum of 1, 2, 4, 8, 6, 9 is: ', mysum(1, 2, 4, 8, 6, 9)) else: print('imported module name is: '__name__) # myprog.py import mymath as mm x = 123 if __name__ == '__main__': print('module name is: ', __name__) print(x, 'cubed is', mm.cube(x)) print('square root of', x, 'is', mm.sqrt(x))
Good Enough For Us
- You can be much more sophisticated than this in organizing your code, via:
- Environment variable settings
- Configuration files
- And/or IDE configuration settings
- Details are system-specific, IDE-specific
- What we have shown will be good enough for us
NumPy, Pandas, SciPy, and Statsmodels
- Powerful and popular data analysis modules
- NumPy: ndarray n-dimensional arrays, related math functions, linear algebra,...
- Pandas: Series INDEXED DATA SERIES AND DataFrame "spreadsheet" like facilities
- SciPy: efficient numerical routines: integraton, cubic spline, optimization,...
- Statsmodels: statistical models, tests, data exploration, using DataFrames
- We will follow an import naming convention common in Python examples and documentation:
import numpy as np import pandas as pd import scipy as sp import statsmodels as sm import matplotlib.pyplot as plt
NumPy ndarray
- A one-dimensional ndarray is "like" an optimized list for vextor operations
- Data in contigous memory
- Vectorized computation algorithms in C
- Individua item access with [] is the same for ndarray as for list
- An ndarray does not store Python int values
- Stores efficient but more restrictive C/C++ style 32-bit integers
- An ndarray is iterable, so conversion to basic collection types is easy
- To create an ndarray from a numeric iterable, use np.array (iterable)
- Unlike a list, every element of an ndarray will be of the same type
- Upcasting convert elements to the minimum type able to hold all objects
ls1 = list(range(5)) ls [0, 1, 2, 3, 4] a1 = np.arange(5) # 'array-range' a1 array([0, 1, 2, 3, 4]) a1[2] 2 a1[-1] 4 list(a1) [0, 1, 2, 3, 4] turple(a1) (0, 1, 2, 3, 4)
ndarray Arithmetic: Vector vs. List Operations
### list operations: ls1 *=2 # concatenation ls1 [0, 1, 2, 3, 4, 0, 1, 2, 3, 4] ls1 += 1 # undefined ls1 + ls1 [0, 1, 2, 3, 4, ....., 0, 1, 2, 3, 4] ### ndarray vector operations: a1 *= 2 # scalar multiplication a1 array ([0, 2, 4, 6, 8]) a1 += 1 # add 1 to each element a1
ndarray Slices
- A slice of an ndarray is a view on part of the ndarray
a2 = np.arrnage(10) a2[2:6] array([2,3,4,5]) a2[2:6] = -8 array([0,1,-8,-8,-8,-8,,3,4,5])
ndarray copy
- Must us copy() at get an independent copy of an ndarray or ndarray slice
a4 = a2[:5].copy()
N-dimensional ndarrays
- One way create an N-dimensional ndarray is from a list of lists, or a tuple of tuples
* Or, reshape() an existing ndarray * Return a copy of the reshaped array: no change to the original arraya1 = np.array([[1,2,3,4], [6,4,2,0]]) a1.ndim 2 # 2-dimensional a1.shape {2, 4} # tuple: 2 rows, 4 columns
a2 = np.arrange(12) a3 = a2.reshape(3, 4)
- Or, call a function that creates an ndarray of some shape, where shape is tuple
np.ones(shape) # ndarray of all 1.0s np.zeros(shape) # ndarray of all 0.0s np.full(shape, val) # ndarray of all val np.eye(N) # N*N identity matrix np.identity(N) # N*N identity matrix
2-dimensional ndarray Slices
a1[1:, :3] # row starting from 1; column end but not including 3 a1[2] # 3rd row a1[2].shape a1[2].ndim # a 1-D array a1[2][1] = 5 # same as a1[2,1]=5 a1[2, :] # all columns array([0,2,4,6,8], [-3,-9,-15,7,9], [15,5,9,-2,-1]) a1[2:] array([[15,5,9,-2,-1]] a1[2:].shape (1,5) a1[2:].ndim 2 # 2-D array a1[1,2] # item in row 1, col 3 a1[1:2, 2] # row 1 thur <2, col 2
Boolean Indexes
- You can use a list or ndarray of Booleans to select a viw on subset of an ndarray
- Apply an equality, inequality, or relational operator to an ndarray yields a Bollean ndarray of the same shape
- For indexing purposes Boolean values are treated as binary
- Use & for and, | for or, ^ for xor (one and only one is trus), ~ for invert
- These operators have higher precedence than the equality and relational operators, so use (...)
brows =[True, False, True] a1[brows] a1[:, [True, False, False, True, False]] a1[a1<-5] = 8 a1[(a1>5) & (a1<9)] -= 5
Integer (or "Fancy") Indexes
- YOu can use a list or ndarray of integers to select a view on a subset of an ndarray
- Can change the order of rows or columns, or create duplicates
- From 2-D ndarray, you can use two integer index lists to create a 1-D ndarray of selected values
- The two index lists must be the same length
a1[[2,0]] # row 2 followed by row 0 a1[:, [3,2,1,2]] # all rows, cols 3,2,1,2 a1[[0,1,0,2], [0,4,2,0]] #[0,0], [1,4], [0,2], [2,0]
More NumPy
- NumPy offers many other facilities/methods
- Vectorized array methods: fabs, sqrt, exp, log, ceil, floor, sin, arcsin,...
- Statistical methods: mean, sum, cusum, std, var, min, max,...
- Random number generators for many distributions
- Linear algebra calculations
- Sorting and set operations
- File input and output
Lecture 4: Intro to Pandas
Pandas: Series
- A Series is a one-dimensional sequence of values, together with a same-length sequence of labels: the index of the Series
- By default, the index values are 0 through N-1
- Index does not have to be int!
s1 = pd.Series([3,5,2,4]) s1 0 3 1 5 2 2 3 4 dtype: int64 s1[2] s1.values array([3,5,2,4], dtype=int64) # ndarray s1.index RangeIndex(start=0, stop=4, step=1) # like range s2=pd.Series([4,2,1,5], index=['a', 'x, 'C', '?'])
Series and dict
- It's easy to contruct a Series from a dict
- Index will be from keys, values from values
d1 = {'X': 123.2, 'AAPL': 543, 'CSCO': 47.7} s3 = pd.Series(d1) # dict: ticeker, price s3 X 123.2 AAPL 543 CSCO 47.7
Series Index Check
- You can use dict-style notation to check whether a Series does or does not have a specific index
'AAPL' in s3 True
Series Arithmetic, Slicing, and Indexing
s1 *=2 # times 2 s4 = s1 / 2 # float64 s1 //= 2 # int64 s1 + s1 # int64 np.sqrt(s1) # float64
- If indexes differ between two Series, arithmetic will yield NAN (missing value or not applicable)
- pd.isnull(Series) and pd.notnull(Series) yield Boolean indexes of NaN and non-NaN items, respectively
pd.isnull(s1p6)
Pandas: DataFrame
- A DataFrame is "like" a spreadsheet, with named columns and N rows
- Idea borrowed from R data.frame
- Easy to build from a dict, with column names as keys and equal-length lists as values in rows
f1 = pd.DataFrame(d2)
DataFrame Column Access
- One way to retrieve a column, as a Series:
f1['CEO_buy'] s1.name = 'I am s1'
- One way to retrieve a row, as a Series:
f1.loc[2] # the 3rd row f1.loc[2][0] # the 3rd row, 1st column
- Three ways to retrieve a cell:
f1['CEO_buy'][2] f1.loc[2]['CEO_buy'] f1.loc[2, 'CEO_buy']
DataFrame: Add a Column
- Add a column to a DataFrame much as you would add a keyl value piar to a dict
f1['P/E'] = 20.5 f1['PEG_ratio'] = [17, 0.9, 2.2, 1.1]
DataFrame: Delete a Column
del f1['P/E']
DataFrame: Add a Row
- If the data types of the new row don't match the existing column types, the column types will be upcast!
f1.loc[4] = ['z', True, 8.2, 1.3]
DataFrame: Delete a Row
- Deleting the problem row will not "deupcast" - beware!
fic.drop(0.5) # specifiy index to drop, it creates a copy, it doesn't change fic
DataFrame Slicing and Indexing
- Slicing and indexing works for DateFrames much as for Series
a12.reshape(4,3) f2 = pd.DateFrame(a12.reshape(4,3), index = ['r0', 'r1', 'r2', 'r3'], columns = ['c0', 'c1', 'c2']) f2['c0'] f2.loc['r1'] f2.loc['r0': 'r2'] # up to and include r2 f2[0:2] # up to and not include index 2
DataFrame loc and iloc
- loc uses row index and column labels
- iloc uses row index and column integers
f2.loc['r1'] s2.loc['r0': 'r1'] # r1 through and including r1 s2.loc['r3', 'r1'] # r3 and r1 f2.iloc[2] f2.iloc[1:3] # sub-1 through but not including sub-3 f2.iloc[[3,2]] # sub-3 and sub-2 f2.loc['r1', 'c2'] f2.loc['r0': 'r2', ['c2', 'c0']] f2.iloc[[0, 3], [2, 0]]
Arithmetic with Fill Values
- We have seen that arithmetic combining Data Frames with unlike columns/rows introduces missing values
- DataFrame named arithmetic functions allow you to specify a fill_value argument
- The fill vale is used in place of a missing value in one DataFrame or the other
- But not if the values is missing in both DataFrames!
f3 = f2.copy() f3['c3'] = [1, 3, 5, 7] f3 = f3.drop('r1') f2 + f3 f2.add(f3, fill_value=0)
Sorting
- To sort DataFrame rows, use sort_index()
- For columns, sort_index(axis=1)
f4 = pd.DataFrame(np.array(f2), index=['a', 'd', 'b', 'c'], columns=['z', 'x', 'y']) f4.sort_index() # sort rows f4.sort_index(axis=1) # sort cols f4.sort_index().sort_index(axis=1) # sort both rows and cols f4.sort_index(ascending=False).sort_index(axis=1) # rows descending, cols ascending
DataFrame Summary Statistics
- DataFrame provides "the usual suspects" for summary statistics
np.random.seed(1) a5 = np.random.randn(3,3).round(3) f5 = pd.DataFrame(a5,index=['r0', 'r1', 'r2'], columns=['c0', 'c1', 'c2']) f5.sum() f5.mean() f5.ver() f5.std() f5.describe() idxmin(), inxmax() # row of min, max for each col median() prod() # product of row vals by col skew(), kurt() # skew, kutosis cumsum(), cumprod() diff() pct_change() # % change row over row
Lecture 5: Regular Expressions, Reading/Writing Formatted Data, String Methods
About Regular Expressions
- Regular expressions are a compact, mathematical-style notatoon for matching patterns in text
- Originally developed in the context of human lanaguage parsing
- Then programming lanuage prammmar, parsers
- Python provides an extensive collection of regular expression special characters and notations
Basic Regular Expression Characters
c Matches the character c, unless c is a regular expression special character ^ Matches the starting of a string