Data Focused Python

Lecture 1

Variable Names

A legal variable name consist of * A letter r an underscore(_) * Followed by 0 or more letters and/or decimal digits (0-9) and/or underscores

Low-Level Scalar Types

int
float
str
bytes
bool
None

Arithmetic Operators

** # exponentiation
+ - # unary plus, minus
* / // % # multiply, divide, "floor" divide (truncate toward -infinity), modulus (reminder)
+ - # binary plus minues
# use (...) for grouping

"hello"
'hello'
"hello" + 'hello'
"'hello'" + '"hello"'
"0123456789" * 3 # repetition
'''hello'''
"""hello""" # multiple line quotes

True
False

Arithmetic Operator Associativity - all binary arthmetic operators associate left-to-right, except **, which is right-to-left

2 ** 3 ** 2 = 2 ** (3 ** 2)
9 / 3 * 2 = (9 / 3) * 2

a = 2
b = 4
a **= b # a = a ** b
a -= 5 # a = a - 5

Collection Built-In Types

list * items are indexed from 0 to n-1 * or in reverse from -1 to -n

m = [7, 2 , 3 ,0]

Items of a list can be modified:

[7, 2, 3, 0]
m[2] = 5
[7, 2, 5, 0]

"+" and "*" works as with str objects:

m + [4, 7, 1]
[7, 2, 5, 0, 4, 7, 1]
[0] * 5
[0, 0, 0, 0, 0]

A list supports many named operations, including:

append(val) # "like" push_back, always add to the last
insert(idx, val) # insert ahead of "idx"
remove(val) # first match of val
pop([idx]) # return item at idx, or at n-1 if no idx, and remove that item
count(val)
sort() # ascending by default
reverse()

Logical and Physical Lines

A statement must be contained on a single logical line
A physical line ends with NEWLINE
It is a syntax error to split a logical line across multiple physical lines
unless the physical line ends with \
or the code in within (), [], or {}

x = \
12

x = (
12
)

Multiple Statements on One Physical Line

Use ; between statements on a physical line

a = 4; b = 12.5; c = 'x'

Length: Len

Use "len" to get the number of items in any collection (including a str)

len([3, 2, 4, 4])
4

len('hello')
5

Sequence Slices

"list" and "str" are squence types
A slice of a sequence is obtained with seq[i:j:k] # from item i, to not including item j, with step size k
In seq[i:j:k], missing :k implies :1
missing i implies 0
missing j implies len(seq)

ages = [3, 12, 5, 33, 68]
ages[0:3:1]
[3, 12, 5]
ages[0:2]
[3, 12]
ages[:4]
[3, 12, 5, 33]
ages [1:]
[12, 5, 33, 68]

a = [0, 1, 2, 3, 4, 5, 6, 7]
b = a[4:] + a[1:4]
b
[4, 5, 6, 7, 1, 2, 3]

'hello'[:3] + 'p!'
'help!'

s1 = 'international'
s1[::2]
'itrain1'

't' + s1[:8:2] + s1[7:11]
'titration'

Object Identity: id

Alomost anything in Python is an object
An int, a float, a bool, a str, None, a list, a tuple, a set, ...
A function, a class, an iterator,...
Each object is uniquely identified by its id
The id may or maynot be the memory address, depending on the phython implementation
Collections can have equal items without having the same ids

id(n)

x = 7
id(x)
y = 7
id (y) # same id as "x" and "7"

m= [1,2,3]
id(m)
n= [1,2,3]
id(n) # different id as m

n = m
id(n) # now the same

Equality, Relational and Logical Operators

From high precedence to low precedence
- Notice keywords rather than operator symbols
- Not the same precedence as in Java, C, C++
Use(...) for grouping
Unlike C/C++/Java, equality and relational operators may be chained with the expected mathematical meaning, e.g.:
- a > b > c means a > b and b > c
Evaluation here is left to right
- If a > b is False, then b > c is not evaluated

== != < <= > >= is is not
not
and
or

Identity vs. Value Comparison

"is" and "is not" compare object ids
== and != compare object values

The if Decision, and Identation

The Phython 3 if decision is of the form

if bool_expr:
    statement1
    ... # optional statements
blank line # this concludes the if

Indentation of statement1 is required
- Additional states, if any, must be indented y exactly the same amount
Even identing the first character in a top-level line is an error
Line and Indentation Rules: Good or bad?
Good:
- Forces a common indentation scheme for all programmers
- No curly braces to keep track of
Not so good:
- Line continuation is clunky
- In long code, hard to see both beginning of id and end of id on the same screen

Using the IDLE Editor

Click File / New File to create a new file
Later, click Run / Run Module to execute the code

The print Function

In the interactive shell, the value of a typed expression (other than an assignment) is automatically displayed
In code in a file, you can use the print function to display output
- By default, displays values separated with spaces

print(a, 'Bob', 11/3, True)
hello Bob 3.666666666665 True

Keyword Arguments

Many functions provide keyword arguments
- Of the form name=value
- Must follow all of the positional arguments

a = 'hello'
print(a, 'Bob', 11/3, True, sep=',') # change the seperater (keyword argument)
hello, Bob, 3.6666665, True

General if/elif/else Decisions

A decision must start with one if part
- Optionally follow by zero or more elif parts
- Optionally follow by zero or more else parts

if bool-expr:
    stmt
elif bool_expr:
    stmt
else:
    stmt

if a == 'hello':
    print(a, 'is equal to \'hello\'')
elif b < c:
    print(b, 'is less than', c)
else:
    print('Noe of the above!')

General while Loops

A while loop is straightforward

while bool_expr:
    stmt

i = 0
while 1 <10:
    print(i, end='') # a space will be displayed instead of a new line
    i += 1

Iterating with for Loops

A for loop steps through each item in an iterable, such as a sequence object

for var in iterable:
    stmt

for i in [1, 5, 9, -4, 12]:
    print(i ** 2)

for c in 'hello':
    print(c)

The range Function

The range function provides a useful iterable

for var in range(N): # 0, 1, 2, ..., N-1
    stmt

for var in range(M,N): # M, M+1, M+2, ..., N-1
    stmt

for var in range(M,N,S): # M, M+s, M+2S, ..., <N
    stmt

for var in range(5): 
    print(i) # 0, 1, 2, 3, 4

for var in range(4,-1,-1): 
    print(i) # 4, 3, 2, 1, 0

Modules

Python uses modules of code for extended capabilities
- Modules, or module items, can be imported into your code
- For example, the math module contains many common mathematical functions and values

The math Module

import math
math.exp(1)
import math as m
m.exp(1)
m.cos(0)
m.sqrt(2)
m.log(1234)
from math import exp
exp(1) # if you just need exp()
from math import * # very risky! everything
exp(sqrt(sin(pi))) # all math available???
pi
sin(pi)
sqrt(sin(pi))
exp(sqrt(sin(pi)))

Copying a Text File

Quick and dirty: we will do better than this

fin = open('/dir1/dir2/in.txt','rt', encoding ='utf-8') # input file, write text
fout = open('/dir1/dir2/out.txt','wt', encoding ='utf-8') # output file, create or overwritten file
for line in fin: #line is a str
    fout.write(line) # write the line into the output file
fint.close() # less important to close input file
fout.close() # must close output file
# \ means another line
# / direcrtory name seperator
# 'rt' - read text

Lecture 2: More Collections, Type Conversion, and Web Scraping

Collection Build-In Types

Python 3.7 provides

list # "like" an array
tuple # "like" a record
set # just one of each value
frozenset # (we won't care)

A str can also be used as a collection

turple Example

A turple is enclosed in (...)
- Items are indexed from 0 to n-1
  - Use [inx] to access an item
- Or in reverse from -1 to -n

n = ('a', 'golden', 'eage', 'soars')
type(n)
n
n[2]

* A turple is immutable * tuple items cannot be modified * But a variable can be changed to refer to a different tuple

n = (n[0], 'bald', n[2], n[3])
n

To assign turple items to seperate variables
- Called sequence unpacking

a,b,c,d = n
c

You can construct a turple without parens
- Called tuple packing

t = 1, 4, 'hi', 4.6
t

Multiple Assignment

Multiple assignment combines tuple packing and sequence unpacking

a, b, c, d = 'hi', 12.6, True, 9

To swap the value of two variables

a = -7
b = 3
b, a = a, b
print (a, b)

Empty and One-Item tuples

An empty tuple can be represented as ()

t = ()
type(t)
t

But parens are also used for grouping
- (value) is simply value

t = (6)
t

The goofy (value,) creates a one-item tuple

t = (6,) # otherwise it's an int
t
len(t)

tuple Slides and Concatenation

Like a ist or str, a tuple is a sequence, supporting slices
- Another way to construct a tuple from an existing tuple:

n
n = n[:1] + ('harpy',) + n[2:] # everything up to but not includ 1/2
n

set Example

A set is enclosed in {...}
- Items are unsorted (although may appear sorted when displayed), with no duplicates

s = { 1, 6, 5, 9, 2, 1, 6, 3}
type(s)
s

set Named Operations

set provides many named operations for manipulating items, including

add(val) # add val to set
discard(val) # remove val from set, if val is a member
remove(val) # remove val, and fail if val is not a member
pop() # remove and return an arbitrary item
clear() #  remove all items
val in s # val is in s True/Flase

s
s.add(7)
s.add(3) # 3 is already in set, therefore is not added
s.discard(7) # 7 is gone
s.discard(13) # 13 is not there
s.remove(2) # 2 is removed
s.remove(13) # fails as there's no 13
s.pop()
s.add(1)
s.add(3)
s.add(13)
s.add(6) # 6 is already there so get ignored

set vs set Operations

set provides "the usual suspects" of set vs. set operatios (methods)

s1.differences(s2) # s1-s2
s1.symmetric_difference(s2) # (s1-s2) and (s2-s1) unnion
s1.isdisjoint(s2) # s1 and s2 intersection
s1.issubset(s2) # s1 is subset of s2, True or False
s1.issuperset(s2) # s2 is subset of s1, True or False
s1.union(s2) # s1 and s2 union
s1 - s2 # remove intersection

set Symbolic Operations

set provides some symobic (borrowed from C/C++) operations that can be used rather than named operations

- # set difference
& # "and" (intersection)
^ # "xor" (symmetric difference)
| # "or" (union)
# from highest to lowest precedence

Remember: All Generic, All The Time

A set can contain any hashable items
- Scalar: int, float, bool, str, None
- tuples if all items are hasgable
- Not lists or sets, since these are mutable 可变的

s3 = {5, 4.7, None, 'hello', True, (2,9,14)}
s3

The dict Collection Type

Python 3.7 also provides

dict # dictionary

A dict consists of key: value pairs, enclosed in {...}
- Keys must be hashable
- No restrictuons on values

n2e = {'john': 'jkstlund@gmail.com',
        'al': 'al@alcorp.net',
        'bob': 'bob@bassoc.com'}
type(n2e)
n2e['cy'] = 'cy@nou.edu' # add one more item
# store in the order of key creation
n2e['john'] = 'jkstlund@andrew.cmu.edu' # the value is changed

dict Operations

dict provides many named operations for manipulating items, including

get(key[,def]) # return value for key, or def if key does not exist, or None if def is not provided
popitem() # return some (key, value) pair as a tuple
pop(key[, def]) # return value for key, and remove (key, value) from doct if key not found, def or fail
clear # remove all items

dict Iterables

doct provides three iterables that make it easy to loop through keys, values, or items

keys() # iterable over all dict keys
values() # iterable over all dict values
items() # iterable over all dict (key,values) tuples
n2e. keys()
n2e. values()
n2e. items()
for k in n2e.keys():
    print(k)
n2e['dave'] = 'dave@dave.org'
n2e
for j in n2e.keys():
    print(k)
for i in n2e.items():
    print(i)

Empty and One-Item sets and dics

Both set and dict have items enclosed in {}

de = {} # empty dict
type(de)
de
se = set() # empty set
type(se)
se

d1 = {key: value} # one-item dict
d1
s1 = {value} # one-item set
si

set and dict are not sequences
- No slicing with [m:n]

Conversions Among Low-Level Built-In Types

Virtually all objects can be converted to strong via str
Conversions among int, float, and bool work "within reason**

str(None)
str(12.354e8)
str(n2ek)

int(4.567)
4
int(-4.567) # truncate, not floor!
-
int('456')
456
int(True)
1

float(5)
5.0
float('5.432')
5.432
bool(4321)
True
bool(0.0)
False
bool('')
False
bool(' ')
True

A Convenient Web Scraping Module: BeautifulSoup

To "scrape" a website
- Open a connection to the web page
- Download the source (text) of the web page
- Concert the source to an HTML-aware BeautifulSoup object
- Write the BeautifulSoup to a file, and examine the contents for HTML tags: ...
- Extract the tagged information you want from the BeautifulSoup

Scaping the Yield Curve

from urllib.request import urlopen # b_soup1.py
from bs4 import BeautifulSoup
html = urlopn('')
bsyc = Beautifulsoup(html.read(), "lxml)
fout = open('bsyc_temp.txt', 'wt', encoding='utf-8')
fout.write(str(bsyc))
fout.close()

Searching bsyc_temp.txt for 08/01/19, we are lucky to only find it once

<td class="text_view_data"
scope="row">08/01/19</td>

The tag is td

Googling HTML td tells us this is a cell in a table

Searching backward for , we find

# print the first table
print(str(bsyc.table))
# ... not the one we want

# so get a list of all table tags
table_list = bsyc.findAll('table')

# how many are there?
print('there are', len(table_list), 'table tags')

# look at the first 50 chars of each table
for t in table_list:
    print(str(t)[:50])

# only one class="t-chart" table, so add that to findAll as a dictionary attribute
tc_table_list = bsyc.findAll('table', {"class": "t-chart"})

# how many are there?
print (len(tc_table_list), 't-shart table')

# only 1 t-chart table, so grab it
tc_table = tc_table_list[0]

# what are this table's components/children?
for c in tc_table.children:
    print(str(c)[:50])

# tag tr means table row, containing table data
# what are the children of those rows?
for c in tc_table.children:
    for r in c.children:
        print(str(r)[:50])

# we have found the table data!
# just get the contents of each cell
for c in tc_table.children:
    for r in c.children:
        print(r.content)

Lecture 3: Construction and Comprehension, Exceptions, User Input, Functions, Modules, and Intro to Numpy

list, tuple, and set Construction

list, tuple, and set objects can be constructed from iterables

tup1 = tuple('this is a test')
tup1
('t', 'h',....)

s1 = set(tup1)
s1
{'t', 'h',....}

ls1 = list(s1)
ls1.sort()
ls1
['t', 'h',....]

dict Construction

A dict object can be constructed from an iterable on 2-tuples

set_of_2tups = {('a', 12), ('b', 22)}
set_of_2tups
{(('b', 22),'a', 12)}
d1 = dict(set_of_2tups)
d1
{('b': 22), ('a': 12)}

The zip() Function

The zip() function zips two (or more) iterables together into iterable on tuples
- Handy for creating a dict from two iterables

ls1
[' ', 'a', 'e', 'h', 'i', 's', 't']
d2 = dict(zip(ls1, range(len(ls1))))
{' ': 0, 'a': 1, 'e': 2,..., 't': 6}

Comprehensions

A comprehension is a concide way of building a lis, tuple, set, or dict
- It is "Pythnic", meaning cool
- Comprehensions can be clear... or very obscure

A list Comprehensions

The brute force way to create a list of int values from 0 through 15:

m20 = [0, 1, 2, 3, 4, 5, 6,..., 15]

A list From a for Loop

An easier and less error-prone way

m20 =[] # start empty
for v in range(16):
    m20.append(v)

A list Comprehension

A list comprehension puts the for loop inside the list

m20 = [v for v in range(16)] # print 0 to 15
m21 = [0 for v in range(8)]
m22 = [v**2 for v in range(8)]
m23 = [v/2 for v in range(10) if v % 2 ==1]
import ath as m
m24 = [m.cos(m.pi * v/4 for v in range(8))]
m25 = [(v**(1/3),v**.5, v, v**2, v**3) for v in range(9)] # a list of tuples

[expr for var(s) in iter [for_or_if...]] # in general

set Comprehension

A set comprehension is like a list comprehension
- Use {} rather than []

s20 = {v..... for loop...}

dict Comprehension

A dict comprehension can also use {}
- Items must be specified with key: value notation

d20 = {k: k**2 for k in range(8)}

Exception

Many program errors rasie exceptions

7 / 0 # zero division error
d = dict(5) # type error
x = float('12.3456') # value error
fin = open('/foo/asdf', 'rt') # file not found error

try... except

You can try a block of statements
- If an exception eccurs, use except to capture and handle the exception
  - except in this form handles any kind of exception

try:
    val = 7/0
except: # handle any exception
    val = -1.0
val
-1.0

except FileNotFoundError: # a specific kind of error

Handling User Input

Users make all kinds of errors
- Use input(prompt) to read user input as a string
- Convert to desired type: int, float,...
- Use try...except to deal with formatting errors
- Use normal logic to deal with range errors

answer = input('Please enter your name: ')
age = input ('Please enter your age: ')
age_val = float(age)

age_bad = True # user_age.py
age = 0.0
while age_bad:
    try:
        age_str = input("Enter your age: ")
        age = float(age_str)
    except:
        print("Bad age format")
        age = -1.0
    if not 0.0 < age <= 125.0:
        print("Enter value in [0.0, 125.0])
    else:
        age_bad = False
print("Age is", age)

Defining and Calling Functions

A function definition has this form, in which
- p1, p2,..., are optional positional parameters
- n1=v1,n2=v2,... are optional so-called keyword parameters and their corresponding default values

def say_hi():
    print("hi")

def func_name(p1,p2,..., n1=v1, n2=v2,...):
    stmt

def ret_pow(x, y=2): # positional argument vs. keyword argument
    return x ** y

Variadic Functions

A variadic function is a function that can be called with a varying number of arguments.
- A function can be defined to receive a varying number of positional arguments via the notation *args after the required positional arguments
- Actually, any identfier can be used: args is conventional

Variadic Positional Arguments

Within the body, args is a tuple of all trailing argument values

def very_fun (a, b, c, *args):
    print (a,b,c, args) # args is an empty turple

very_fun (1, 2, 3, 4, 5, 6, 7, 8, 9) # 4-9 will be added to the turple

def mysum(*args):
    print(args)
    sum=0
    for v in args:
        sum += v
    return sum

mysum(1,2,3,4,5,6)

Variadic Keyword (Named) Arguments

A function can be defined to receive a varying number of keyword (named) arguments via the notation **kwargs
- The name **kwargs is concentional
- **kwargs must come after positional and other keyword arguments, if any
- Within the body, kwargs is a dict of all trailing keyword arguments and values

def vf2 (a, *args, b=42, **kwargs): # kwargs is an empty dictionary
    print(a, args, b, kwargs)

vf2 (1,2,3,4,b=5,arg=6,ht=70,nm='Joe')

About Modules

A module file contains Python code (file.py)
- A package is a hierachically structured collection of related modules - beyond our scope
When you run a module in IDLE or some other Python IDE, that is the main module
- It may import and use all or parts of other modules
An interactive Python shell considers itself the main module

# this is my test module, testmod.py
import math as m  # import other modules
print('I am the testmod.py module')
print('The value of pi is (approximately):', m.pi)

Module Contents

A module may define
- Variables (like pi or e)
- Functions (like sqrt or cos)
- Classes (like BinaryTree)
The name of the module is simply the name of the code file, with the .py removed
- mystuff,py contains the mystuff module

The name Vaiable

Within any module, variable name is set to the name of the module
- The name of the main module is 'main'
- In the interactive shell

__name__
var()

Module Test Code

For development and testing, code that "just runs" can be placed near the end of the module, like so:

# mymodule.py
def fun1():
    ...
c_num = '95888'
if __name__ == '__main__':
    fun1()      # test call of fun1
    print(c_num)    # display c_num value

If mymodule.py is run from within IDLE or another IDE, the code following if name == 'main': will be executed
But if some other module does import mymodule, the test code will not be executed, because: name == 'mymodule'

Example: mymath.py

# mymath.py
def sqrt(x):
    return x ** .5
def cube(x):
    return x ** 3
def mysum(*args):
    x = 0
    for v in args:
        x += v
    return x

if __name__ == '__main__':
    print('module name is: ', __name__)
    print('square root of 3: ', sqrt(3))
    print('sum of 1, 2, 4, 8, 6, 9 is: ',
    mysum(1, 2, 4, 8, 6, 9))
else:
    print('imported module name is: '__name__)

# myprog.py
import mymath as mm
x = 123
if __name__ == '__main__':
    print('module name is: ', __name__)
    print(x, 'cubed is', mm.cube(x))
    print('square root of', x, 'is', mm.sqrt(x))

Good Enough For Us

You can be much more sophisticated than this in organizing your code, via:
- Environment variable settings
- Configuration files
- And/or IDE configuration settings
Details are system-specific, IDE-specific
- What we have shown will be good enough for us

NumPy, Pandas, SciPy, and Statsmodels

Powerful and popular data analysis modules
NumPy: ndarray n-dimensional arrays, related math functions, linear algebra,...
Pandas: Series INDEXED DATA SERIES AND DataFrame "spreadsheet" like facilities
SciPy: efficient numerical routines: integraton, cubic spline, optimization,...
Statsmodels: statistical models, tests, data exploration, using DataFrames
We will follow an import naming convention common in Python examples and documentation:

import numpy as np
import pandas as pd
import scipy as sp
import statsmodels as sm
import matplotlib.pyplot as plt

NumPy ndarray

A one-dimensional ndarray is "like" an optimized list for vextor operations
- Data in contigous memory
- Vectorized computation algorithms in C
Individua item access with [] is the same for ndarray as for list
An ndarray does not store Python int values
- Stores efficient but more restrictive C/C++ style 32-bit integers
An ndarray is iterable, so conversion to basic collection types is easy
To create an ndarray from a numeric iterable, use np.array (iterable)
Unlike a list, every element of an ndarray will be of the same type
- Upcasting convert elements to the minimum type able to hold all objects

ls1 = list(range(5))
ls
[0, 1, 2, 3, 4]

a1 = np.arange(5) # 'array-range'
a1
array([0, 1, 2, 3, 4])

a1[2]
2

a1[-1]
4

list(a1)
[0, 1, 2, 3, 4]

turple(a1)
(0, 1, 2, 3, 4)

ndarray Arithmetic: Vector vs. List Operations

### list operations:
ls1 *=2  # concatenation
ls1
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

ls1 += 1  # undefined

ls1 + ls1
[0, 1, 2, 3, 4, ....., 0, 1, 2, 3, 4]

### ndarray vector operations:
a1 *= 2  # scalar multiplication
a1
array ([0, 2, 4, 6, 8])

a1 += 1  # add 1 to each element
a1

ndarray Slices

A slice of an ndarray is a view on part of the ndarray

a2 = np.arrnage(10)
a2[2:6]
array([2,3,4,5])
a2[2:6] = -8
array([0,1,-8,-8,-8,-8,,3,4,5])

ndarray copy

Must us copy() at get an independent copy of an ndarray or ndarray slice

a4 = a2[:5].copy()

N-dimensional ndarrays

One way create an N-dimensional ndarray is from a list of lists, or a tuple of tuples

a1 = np.array([[1,2,3,4], [6,4,2,0]])

a1.ndim 
2 # 2-dimensional

a1.shape
{2, 4} # tuple: 2 rows, 4 columns

* Or, reshape() an existing ndarray * Return a copy of the reshaped array: no change to the original array

a2 = np.arrange(12)
a3 = a2.reshape(3, 4)

Or, call a function that creates an ndarray of some shape, where shape is tuple

np.ones(shape) # ndarray of all 1.0s
np.zeros(shape) # ndarray of all 0.0s
np.full(shape, val) # ndarray of all val
np.eye(N) # N*N identity matrix
np.identity(N) # N*N identity matrix

2-dimensional ndarray Slices

a1[1:, :3] # row starting from 1; column end but not including 3

a1[2] # 3rd row
a1[2].shape
a1[2].ndim # a 1-D array

a1[2][1] = 5 # same as a1[2,1]=5

a1[2, :] # all columns 

array([0,2,4,6,8], [-3,-9,-15,7,9], [15,5,9,-2,-1])
a1[2:]
array([[15,5,9,-2,-1]]
a1[2:].shape
(1,5)
a1[2:].ndim
2 # 2-D array

a1[1,2] # item in row 1, col 3
a1[1:2, 2] # row 1 thur <2, col 2

Boolean Indexes

You can use a list or ndarray of Booleans to select a viw on subset of an ndarray
Apply an equality, inequality, or relational operator to an ndarray yields a Bollean ndarray of the same shape
For indexing purposes Boolean values are treated as binary
- Use & for and, | for or, ^ for xor (one and only one is trus), ~ for invert
- These operators have higher precedence than the equality and relational operators, so use (...)

brows =[True, False, True]
a1[brows]

a1[:, [True, False, False, True, False]]

a1[a1<-5] = 8

a1[(a1>5) & (a1<9)] -= 5

Integer (or "Fancy") Indexes

YOu can use a list or ndarray of integers to select a view on a subset of an ndarray
- Can change the order of rows or columns, or create duplicates
From 2-D ndarray, you can use two integer index lists to create a 1-D ndarray of selected values
- The two index lists must be the same length

a1[[2,0]] # row 2 followed by row 0
a1[:, [3,2,1,2]] # all rows, cols 3,2,1,2

a1[[0,1,0,2], [0,4,2,0]] #[0,0], [1,4], [0,2], [2,0]

More NumPy

NumPy offers many other facilities/methods
- Vectorized array methods: fabs, sqrt, exp, log, ceil, floor, sin, arcsin,...
- Statistical methods: mean, sum, cusum, std, var, min, max,...
- Random number generators for many distributions
- Linear algebra calculations
- Sorting and set operations
- File input and output

Lecture 4: Intro to Pandas

Pandas: Series

A Series is a one-dimensional sequence of values, together with a same-length sequence of labels: the index of the Series
- By default, the index values are 0 through N-1
Index does not have to be int!

s1 = pd.Series([3,5,2,4])

s1
0  3
1  5
2  2
3  4
dtype: int64

s1[2]

s1.values
array([3,5,2,4], dtype=int64) # ndarray

s1.index
RangeIndex(start=0, stop=4, step=1) # like range

s2=pd.Series([4,2,1,5], index=['a', 'x, 'C', '?'])

Series and dict

It's easy to contruct a Series from a dict
- Index will be from keys, values from values

d1 = {'X': 123.2, 'AAPL': 543, 'CSCO': 47.7}
s3 = pd.Series(d1)  # dict: ticeker, price
s3

X    123.2
AAPL 543
CSCO 47.7

Series Index Check

You can use dict-style notation to check whether a Series does or does not have a specific index

'AAPL' in s3
True

Series Arithmetic, Slicing, and Indexing

s1 *=2  # times 2
s4 = s1 / 2  # float64
s1 //= 2  # int64
s1 + s1 # int64
np.sqrt(s1) # float64

If indexes differ between two Series, arithmetic will yield NAN (missing value or not applicable)
pd.isnull(Series) and pd.notnull(Series) yield Boolean indexes of NaN and non-NaN items, respectively

pd.isnull(s1p6)

Pandas: DataFrame

A DataFrame is "like" a spreadsheet, with named columns and N rows
- Idea borrowed from R data.frame
- Easy to build from a dict, with column names as keys and equal-length lists as values in rows

f1 = pd.DataFrame(d2)

DataFrame Column Access

One way to retrieve a column, as a Series:

f1['CEO_buy']
s1.name = 'I am s1'

One way to retrieve a row, as a Series:

f1.loc[2] # the 3rd row
f1.loc[2][0] # the 3rd row, 1st column

Three ways to retrieve a cell:

f1['CEO_buy'][2]
f1.loc[2]['CEO_buy']
f1.loc[2, 'CEO_buy']

DataFrame: Add a Column

Add a column to a DataFrame much as you would add a keyl value piar to a dict

f1['P/E'] = 20.5
f1['PEG_ratio'] = [17, 0.9, 2.2, 1.1]

DataFrame: Delete a Column

del f1['P/E']

DataFrame: Add a Row

If the data types of the new row don't match the existing column types, the column types will be upcast!

f1.loc[4] = ['z', True, 8.2, 1.3]

DataFrame: Delete a Row

Deleting the problem row will not "deupcast" - beware!

fic.drop(0.5) # specifiy index to drop, it creates a copy, it doesn't change fic

DataFrame Slicing and Indexing

Slicing and indexing works for DateFrames much as for Series

a12.reshape(4,3)

f2 = pd.DateFrame(a12.reshape(4,3), index = ['r0', 'r1', 'r2', 'r3'], columns = ['c0', 'c1', 'c2'])

f2['c0']
f2.loc['r1']

f2.loc['r0': 'r2'] # up to and include r2
f2[0:2] # up to and not include index 2

DataFrame loc and iloc

loc uses row index and column labels
iloc uses row index and column integers

f2.loc['r1']
s2.loc['r0': 'r1'] # r1 through and including r1
s2.loc['r3', 'r1'] # r3 and r1

f2.iloc[2]
f2.iloc[1:3] # sub-1 through but not including sub-3
f2.iloc[[3,2]] # sub-3 and sub-2

f2.loc['r1', 'c2']
f2.loc['r0': 'r2', ['c2', 'c0']]
f2.iloc[[0, 3], [2, 0]]

Arithmetic with Fill Values

We have seen that arithmetic combining Data Frames with unlike columns/rows introduces missing values
DataFrame named arithmetic functions allow you to specify a fill_value argument
- The fill vale is used in place of a missing value in one DataFrame or the other
- But not if the values is missing in both DataFrames!

f3 = f2.copy()
f3['c3'] = [1, 3, 5, 7]
f3 = f3.drop('r1')
f2 + f3

f2.add(f3, fill_value=0)

Sorting

To sort DataFrame rows, use sort_index()
- For columns, sort_index(axis=1)

f4 = pd.DataFrame(np.array(f2), index=['a', 'd', 'b', 'c'], columns=['z', 'x', 'y'])

f4.sort_index() # sort rows
f4.sort_index(axis=1) # sort cols

f4.sort_index().sort_index(axis=1) # sort both rows and cols

f4.sort_index(ascending=False).sort_index(axis=1) # rows descending, cols ascending

DataFrame Summary Statistics

DataFrame provides "the usual suspects" for summary statistics

np.random.seed(1)
a5 = np.random.randn(3,3).round(3)
f5 = pd.DataFrame(a5,index=['r0', 'r1', 'r2'], columns=['c0', 'c1', 'c2'])

f5.sum()
f5.mean()
f5.ver()
f5.std()
f5.describe()

idxmin(), inxmax() # row of min, max for each col
median()
prod() # product of row vals by col
skew(), kurt() # skew, kutosis
cumsum(), cumprod()
diff()
pct_change() # % change row over row

Lecture 5: Regular Expressions, Reading/Writing Formatted Data, String Methods

About Regular Expressions

Regular expressions are a compact, mathematical-style notatoon for matching patterns in text
- Originally developed in the context of human lanaguage parsing
- Then programming lanuage prammmar, parsers
Python provides an extensive collection of regular expression special characters and notations

Basic Regular Expression Characters

c Matches the character c, unless c is a regular expression special character ^ Matches the starting of a string