Skip to content

Data Focused Python

Lecture 1

Variable Names

A legal variable name consist of * A letter r an underscore(_) * Followed by 0 or more letters and/or decimal digits (0-9) and/or underscores

Low-Level Scalar Types

  • int
  • float
  • str
  • bytes
  • bool
  • None

Arithmetic Operators

** # exponentiation
+ - # unary plus, minus
* / // % # multiply, divide, "floor" divide (truncate toward -infinity), modulus (reminder)
+ - # binary plus minues
# use (...) for grouping
"hello"
'hello'
"hello" + 'hello'
"'hello'" + '"hello"'
"0123456789" * 3 # repetition
'''hello'''
"""hello""" # multiple line quotes
True
False

Arithmetic Operator Associativity - all binary arthmetic operators associate left-to-right, except **, which is right-to-left

2 ** 3 ** 2 = 2 ** (3 ** 2)
9 / 3 * 2 = (9 / 3) * 2
a = 2
b = 4
a **= b # a = a ** b
a -= 5 # a = a - 5

Collection Built-In Types

list * items are indexed from 0 to n-1 * or in reverse from -1 to -n

m = [7, 2 , 3 ,0]

Items of a list can be modified:

[7, 2, 3, 0]
m[2] = 5
[7, 2, 5, 0]

"+" and "*" works as with str objects:

m + [4, 7, 1]
[7, 2, 5, 0, 4, 7, 1]
[0] * 5
[0, 0, 0, 0, 0]

A list supports many named operations, including:

append(val) # "like" push_back, always add to the last
insert(idx, val) # insert ahead of "idx"
remove(val) # first match of val
pop([idx]) # return item at idx, or at n-1 if no idx, and remove that item
count(val)
sort() # ascending by default
reverse()

Logical and Physical Lines

  • A statement must be contained on a single logical line
  • A physical line ends with NEWLINE
  • It is a syntax error to split a logical line across multiple physical lines
  • unless the physical line ends with \
  • or the code in within (), [], or {}
x = \
12

x = (
12
)

Multiple Statements on One Physical Line

  • Use ; between statements on a physical line
a = 4; b = 12.5; c = 'x'

Length: Len

  • Use "len" to get the number of items in any collection (including a str)
len([3, 2, 4, 4])
4

len('hello')
5

Sequence Slices

  • "list" and "str" are squence types
  • A slice of a sequence is obtained with seq[i:j:k] # from item i, to not including item j, with step size k
  • In seq[i:j:k], missing :k implies :1
  • missing i implies 0
  • missing j implies len(seq)
ages = [3, 12, 5, 33, 68]
ages[0:3:1]
[3, 12, 5]
ages[0:2]
[3, 12]
ages[:4]
[3, 12, 5, 33]
ages [1:]
[12, 5, 33, 68]
a = [0, 1, 2, 3, 4, 5, 6, 7]
b = a[4:] + a[1:4]
b
[4, 5, 6, 7, 1, 2, 3]

'hello'[:3] + 'p!'
'help!'

s1 = 'international'
s1[::2]
'itrain1'

't' + s1[:8:2] + s1[7:11]
'titration'

Object Identity: id

  • Alomost anything in Python is an object
  • An int, a float, a bool, a str, None, a list, a tuple, a set, ...
  • A function, a class, an iterator,...
  • Each object is uniquely identified by its id
  • The id may or maynot be the memory address, depending on the phython implementation
  • Collections can have equal items without having the same ids
id(n)

x = 7
id(x)
y = 7
id (y) # same id as "x" and "7"

m= [1,2,3]
id(m)
n= [1,2,3]
id(n) # different id as m

n = m
id(n) # now the same

Equality, Relational and Logical Operators

  • From high precedence to low precedence
    • Notice keywords rather than operator symbols
    • Not the same precedence as in Java, C, C++
  • Use(...) for grouping
  • Unlike C/C++/Java, equality and relational operators may be chained with the expected mathematical meaning, e.g.:
    • a > b > c means a > b and b > c
  • Evaluation here is left to right
    • If a > b is False, then b > c is not evaluated
== != < <= > >= is is not
not
and
or

Identity vs. Value Comparison

  • "is" and "is not" compare object ids
  • == and != compare object values

The if Decision, and Identation

  • The Phython 3 if decision is of the form
    if bool_expr:
        statement1
        ... # optional statements
    blank line # this concludes the if
    
  • Indentation of statement1 is required
    • Additional states, if any, must be indented y exactly the same amount
  • Even identing the first character in a top-level line is an error
  • Line and Indentation Rules: Good or bad?
  • Good:
    • Forces a common indentation scheme for all programmers
    • No curly braces to keep track of
  • Not so good:
    • Line continuation is clunky
    • In long code, hard to see both beginning of id and end of id on the same screen

Using the IDLE Editor

  • Click File / New File to create a new file
  • Later, click Run / Run Module to execute the code

The print Function

  • In the interactive shell, the value of a typed expression (other than an assignment) is automatically displayed
  • In code in a file, you can use the print function to display output
    • By default, displays values separated with spaces
print(a, 'Bob', 11/3, True)
hello Bob 3.666666666665 True

Keyword Arguments

  • Many functions provide keyword arguments
    • Of the form name=value
    • Must follow all of the positional arguments
a = 'hello'
print(a, 'Bob', 11/3, True, sep=',') # change the seperater (keyword argument)
hello, Bob, 3.6666665, True

General if/elif/else Decisions

  • A decision must start with one if part
    • Optionally follow by zero or more elif parts
    • Optionally follow by zero or more else parts
if bool-expr:
    stmt
elif bool_expr:
    stmt
else:
    stmt
if a == 'hello':
    print(a, 'is equal to \'hello\'')
elif b < c:
    print(b, 'is less than', c)
else:
    print('Noe of the above!')

General while Loops

  • A while loop is straightforward
while bool_expr:
    stmt
i = 0
while 1 <10:
    print(i, end='') # a space will be displayed instead of a new line
    i += 1

Iterating with for Loops

  • A for loop steps through each item in an iterable, such as a sequence object
for var in iterable:
    stmt
for i in [1, 5, 9, -4, 12]:
    print(i ** 2)

for c in 'hello':
    print(c)

The range Function

  • The range function provides a useful iterable
for var in range(N): # 0, 1, 2, ..., N-1
    stmt

for var in range(M,N): # M, M+1, M+2, ..., N-1
    stmt

for var in range(M,N,S): # M, M+s, M+2S, ..., <N
    stmt
for var in range(5): 
    print(i) # 0, 1, 2, 3, 4

for var in range(4,-1,-1): 
    print(i) # 4, 3, 2, 1, 0

Modules

  • Python uses modules of code for extended capabilities
    • Modules, or module items, can be imported into your code
    • For example, the math module contains many common mathematical functions and values

The math Module

import math
math.exp(1)
import math as m
m.exp(1)
m.cos(0)
m.sqrt(2)
m.log(1234)
from math import exp
exp(1) # if you just need exp()
from math import * # very risky! everything
exp(sqrt(sin(pi))) # all math available???
pi
sin(pi)
sqrt(sin(pi))
exp(sqrt(sin(pi)))

Copying a Text File

  • Quick and dirty: we will do better than this
fin = open('/dir1/dir2/in.txt','rt', encoding ='utf-8') # input file, write text
fout = open('/dir1/dir2/out.txt','wt', encoding ='utf-8') # output file, create or overwritten file
for line in fin: #line is a str
    fout.write(line) # write the line into the output file
fint.close() # less important to close input file
fout.close() # must close output file
# \ means another line
# / direcrtory name seperator
# 'rt' - read text

Lecture 2: More Collections, Type Conversion, and Web Scraping

Collection Build-In Types

  • Python 3.7 provides
    list # "like" an array
    tuple # "like" a record
    set # just one of each value
    frozenset # (we won't care)
    
  • A str can also be used as a collection

turple Example

  • A turple is enclosed in (...)
    • Items are indexed from 0 to n-1
      • Use [inx] to access an item
    • Or in reverse from -1 to -n

n = ('a', 'golden', 'eage', 'soars')
type(n)
n
n[2]
* A turple is immutable * tuple items cannot be modified * But a variable can be changed to refer to a different tuple

n = (n[0], 'bald', n[2], n[3])
n
  • To assign turple items to seperate variables
    • Called sequence unpacking
a,b,c,d = n
c
  • You can construct a turple without parens
    • Called tuple packing
t = 1, 4, 'hi', 4.6
t

Multiple Assignment

  • Multiple assignment combines tuple packing and sequence unpacking
a, b, c, d = 'hi', 12.6, True, 9
  • To swap the value of two variables
a = -7
b = 3
b, a = a, b
print (a, b)

Empty and One-Item tuples

  • An empty tuple can be represented as ()
t = ()
type(t)
t
  • But parens are also used for grouping
    • (value) is simply value
t = (6)
t
  • The goofy (value,) creates a one-item tuple
t = (6,) # otherwise it's an int
t
len(t)

tuple Slides and Concatenation

  • Like a ist or str, a tuple is a sequence, supporting slices
    • Another way to construct a tuple from an existing tuple:
n
n = n[:1] + ('harpy',) + n[2:] # everything up to but not includ 1/2
n

set Example

  • A set is enclosed in {...}
    • Items are unsorted (although may appear sorted when displayed), with no duplicates
s = { 1, 6, 5, 9, 2, 1, 6, 3}
type(s)
s

set Named Operations

  • set provides many named operations for manipulating items, including
add(val) # add val to set
discard(val) # remove val from set, if val is a member
remove(val) # remove val, and fail if val is not a member
pop() # remove and return an arbitrary item
clear() #  remove all items
val in s # val is in s True/Flase
s
s.add(7)
s.add(3) # 3 is already in set, therefore is not added
s.discard(7) # 7 is gone
s.discard(13) # 13 is not there
s.remove(2) # 2 is removed
s.remove(13) # fails as there's no 13
s.pop()
s.add(1)
s.add(3)
s.add(13)
s.add(6) # 6 is already there so get ignored

set vs set Operations

  • set provides "the usual suspects" of set vs. set operatios (methods)
s1.differences(s2) # s1-s2
s1.symmetric_difference(s2) # (s1-s2) and (s2-s1) unnion
s1.isdisjoint(s2) # s1 and s2 intersection
s1.issubset(s2) # s1 is subset of s2, True or False
s1.issuperset(s2) # s2 is subset of s1, True or False
s1.union(s2) # s1 and s2 union
s1 - s2 # remove intersection

set Symbolic Operations

  • set provides some symobic (borrowed from C/C++) operations that can be used rather than named operations
- # set difference
& # "and" (intersection)
^ # "xor" (symmetric difference)
| # "or" (union)
# from highest to lowest precedence

Remember: All Generic, All The Time

  • A set can contain any hashable items
    • Scalar: int, float, bool, str, None
    • tuples if all items are hasgable
    • Not lists or sets, since these are mutable 可变的
s3 = {5, 4.7, None, 'hello', True, (2,9,14)}
s3

The dict Collection Type

  • Python 3.7 also provides
dict # dictionary
  • A dict consists of key: value pairs, enclosed in {...}
    • Keys must be hashable
    • No restrictuons on values
n2e = {'john': 'jkstlund@gmail.com',
        'al': 'al@alcorp.net',
        'bob': 'bob@bassoc.com'}
type(n2e)
n2e['cy'] = 'cy@nou.edu' # add one more item
# store in the order of key creation
n2e['john'] = 'jkstlund@andrew.cmu.edu' # the value is changed

dict Operations

  • dict provides many named operations for manipulating items, including
get(key[,def]) # return value for key, or def if key does not exist, or None if def is not provided
popitem() # return some (key, value) pair as a tuple
pop(key[, def]) # return value for key, and remove (key, value) from doct if key not found, def or fail
clear # remove all items

dict Iterables

  • doct provides three iterables that make it easy to loop through keys, values, or items
keys() # iterable over all dict keys
values() # iterable over all dict values
items() # iterable over all dict (key,values) tuples
n2e. keys()
n2e. values()
n2e. items()
for k in n2e.keys():
    print(k)
n2e['dave'] = 'dave@dave.org'
n2e
for j in n2e.keys():
    print(k)
for i in n2e.items():
    print(i)

Empty and One-Item sets and dics

  • Both set and dict have items enclosed in {}
de = {} # empty dict
type(de)
de
se = set() # empty set
type(se)
se

d1 = {key: value} # one-item dict
d1
s1 = {value} # one-item set
si
  • set and dict are not sequences
    • No slicing with [m:n]

Conversions Among Low-Level Built-In Types

  • Virtually all objects can be converted to strong via str
  • Conversions among int, float, and bool work "within reason**
str(None)
str(12.354e8)
str(n2ek)
int(4.567)
4
int(-4.567) # truncate, not floor!
-
int('456')
456
int(True)
1
float(5)
5.0
float('5.432')
5.432
bool(4321)
True
bool(0.0)
False
bool('')
False
bool(' ')
True

A Convenient Web Scraping Module: BeautifulSoup

  • To "scrape" a website
    • Open a connection to the web page
    • Download the source (text) of the web page
    • Concert the source to an HTML-aware BeautifulSoup object
    • Write the BeautifulSoup to a file, and examine the contents for HTML tags: ...
    • Extract the tagged information you want from the BeautifulSoup

Scaping the Yield Curve

from urllib.request import urlopen # b_soup1.py
from bs4 import BeautifulSoup
html = urlopn('')
bsyc = Beautifulsoup(html.read(), "lxml)
fout = open('bsyc_temp.txt', 'wt', encoding='utf-8')
fout.write(str(bsyc))
fout.close()
  • Searching bsyc_temp.txt for 08/01/19, we are lucky to only find it once
<td class="text_view_data"
scope="row">08/01/19</td>
  • The tag is td
    • Googling HTML td tells us this is a cell in a table
    • Searching backward for , we find
      # print the first table
      print(str(bsyc.table))
      # ... not the one we want
      
      # so get a list of all table tags
      table_list = bsyc.findAll('table')
      
      # how many are there?
      print('there are', len(table_list), 'table tags')
      
      # look at the first 50 chars of each table
      for t in table_list:
          print(str(t)[:50])
      
      # only one class="t-chart" table, so add that to findAll as a dictionary attribute
      tc_table_list = bsyc.findAll('table', {"class": "t-chart"})
      
      # how many are there?
      print (len(tc_table_list), 't-shart table')
      
      # only 1 t-chart table, so grab it
      tc_table = tc_table_list[0]
      
      # what are this table's components/children?
      for c in tc_table.children:
          print(str(c)[:50])
      
      # tag tr means table row, containing table data
      # what are the children of those rows?
      for c in tc_table.children:
          for r in c.children:
              print(str(r)[:50])
      
      # we have found the table data!
      # just get the contents of each cell
      for c in tc_table.children:
          for r in c.children:
              print(r.content)
      

      Lecture 3: Construction and Comprehension, Exceptions, User Input, Functions, Modules, and Intro to Numpy

      list, tuple, and set Construction

      • list, tuple, and set objects can be constructed from iterables
      tup1 = tuple('this is a test')
      tup1
      ('t', 'h',....)
      
      s1 = set(tup1)
      s1
      {'t', 'h',....}
      
      ls1 = list(s1)
      ls1.sort()
      ls1
      ['t', 'h',....]
      

      dict Construction

      • A dict object can be constructed from an iterable on 2-tuples
      set_of_2tups = {('a', 12), ('b', 22)}
      set_of_2tups
      {(('b', 22),'a', 12)}
      d1 = dict(set_of_2tups)
      d1
      {('b': 22), ('a': 12)}
      

      The zip() Function

      • The zip() function zips two (or more) iterables together into iterable on tuples
        • Handy for creating a dict from two iterables
      ls1
      [' ', 'a', 'e', 'h', 'i', 's', 't']
      d2 = dict(zip(ls1, range(len(ls1))))
      {' ': 0, 'a': 1, 'e': 2,..., 't': 6}
      

      Comprehensions

      • A comprehension is a concide way of building a lis, tuple, set, or dict
        • It is "Pythnic", meaning cool
        • Comprehensions can be clear... or very obscure

      A list Comprehensions

      • The brute force way to create a list of int values from 0 through 15:
      m20 = [0, 1, 2, 3, 4, 5, 6,..., 15]
      

      A list From a for Loop

      • An easier and less error-prone way
      m20 =[] # start empty
      for v in range(16):
          m20.append(v)
      

      A list Comprehension

      • A list comprehension puts the for loop inside the list
      m20 = [v for v in range(16)] # print 0 to 15
      m21 = [0 for v in range(8)]
      m22 = [v**2 for v in range(8)]
      m23 = [v/2 for v in range(10) if v % 2 ==1]
      import ath as m
      m24 = [m.cos(m.pi * v/4 for v in range(8))]
      m25 = [(v**(1/3),v**.5, v, v**2, v**3) for v in range(9)] # a list of tuples
      
      [expr for var(s) in iter [for_or_if...]] # in general
      

      set Comprehension

      • A set comprehension is like a list comprehension
        • Use {} rather than []
      s20 = {v..... for loop...}
      

      dict Comprehension

      • A dict comprehension can also use {}
        • Items must be specified with key: value notation
      d20 = {k: k**2 for k in range(8)}
      

      Exception

      • Many program errors rasie exceptions
      7 / 0 # zero division error
      d = dict(5) # type error
      x = float('12.3456') # value error
      fin = open('/foo/asdf', 'rt') # file not found error
      

      try... except

      • You can try a block of statements
        • If an exception eccurs, use except to capture and handle the exception
          • except in this form handles any kind of exception
      try:
          val = 7/0
      except: # handle any exception
          val = -1.0
      val
      -1.0
      
      except FileNotFoundError: # a specific kind of error
      

      Handling User Input

      • Users make all kinds of errors
        • Use input(prompt) to read user input as a string
        • Convert to desired type: int, float,...
        • Use try...except to deal with formatting errors
        • Use normal logic to deal with range errors
      answer = input('Please enter your name: ')
      age = input ('Please enter your age: ')
      age_val = float(age)
      
      age_bad = True # user_age.py
      age = 0.0
      while age_bad:
          try:
              age_str = input("Enter your age: ")
              age = float(age_str)
          except:
              print("Bad age format")
              age = -1.0
          if not 0.0 < age <= 125.0:
              print("Enter value in [0.0, 125.0])
          else:
              age_bad = False
      print("Age is", age)
      

      Defining and Calling Functions

      • A function definition has this form, in which
        • p1, p2,..., are optional positional parameters
        • n1=v1,n2=v2,... are optional so-called keyword parameters and their corresponding default values
      def say_hi():
          print("hi")
      
      def func_name(p1,p2,..., n1=v1, n2=v2,...):
          stmt
      
      def ret_pow(x, y=2): # positional argument vs. keyword argument
          return x ** y
      

      Variadic Functions

      • A variadic function is a function that can be called with a varying number of arguments.
        • A function can be defined to receive a varying number of positional arguments via the notation *args after the required positional arguments
        • Actually, any identfier can be used: args is conventional

      Variadic Positional Arguments

      • Within the body, args is a tuple of all trailing argument values
      def very_fun (a, b, c, *args):
          print (a,b,c, args) # args is an empty turple
      
      very_fun (1, 2, 3, 4, 5, 6, 7, 8, 9) # 4-9 will be added to the turple
      
      def mysum(*args):
          print(args)
          sum=0
          for v in args:
              sum += v
          return sum
      
      mysum(1,2,3,4,5,6)
      

      Variadic Keyword (Named) Arguments

      • A function can be defined to receive a varying number of keyword (named) arguments via the notation **kwargs
        • The name **kwargs is concentional
        • **kwargs must come after positional and other keyword arguments, if any
        • Within the body, kwargs is a dict of all trailing keyword arguments and values
      def vf2 (a, *args, b=42, **kwargs): # kwargs is an empty dictionary
          print(a, args, b, kwargs)
      
      vf2 (1,2,3,4,b=5,arg=6,ht=70,nm='Joe')
      

      About Modules

      • A module file contains Python code (file.py)
        • A package is a hierachically structured collection of related modules - beyond our scope
      • When you run a module in IDLE or some other Python IDE, that is the main module
        • It may import and use all or parts of other modules
      • An interactive Python shell considers itself the main module
      # this is my test module, testmod.py
      import math as m  # import other modules
      print('I am the testmod.py module')
      print('The value of pi is (approximately):', m.pi)
      

      Module Contents

      • A module may define
        • Variables (like pi or e)
        • Functions (like sqrt or cos)
        • Classes (like BinaryTree)
      • The name of the module is simply the name of the code file, with the .py removed
        • mystuff,py contains the mystuff module

      The name Vaiable

      • Within any module, variable name is set to the name of the module
        • The name of the main module is 'main'
        • In the interactive shell
      __name__
      var()
      

      Module Test Code

      • For development and testing, code that "just runs" can be placed near the end of the module, like so:
      # mymodule.py
      def fun1():
          ...
      c_num = '95888'
      if __name__ == '__main__':
          fun1()      # test call of fun1
          print(c_num)    # display c_num value
      
      • If mymodule.py is run from within IDLE or another IDE, the code following if name == 'main': will be executed
      • But if some other module does import mymodule, the test code will not be executed, because: name == 'mymodule'

      Example: mymath.py

      # mymath.py
      def sqrt(x):
          return x ** .5
      def cube(x):
          return x ** 3
      def mysum(*args):
          x = 0
          for v in args:
              x += v
          return x
      
      if __name__ == '__main__':
          print('module name is: ', __name__)
          print('square root of 3: ', sqrt(3))
          print('sum of 1, 2, 4, 8, 6, 9 is: ',
          mysum(1, 2, 4, 8, 6, 9))
      else:
          print('imported module name is: '__name__)
      
      # myprog.py
      import mymath as mm
      x = 123
      if __name__ == '__main__':
          print('module name is: ', __name__)
          print(x, 'cubed is', mm.cube(x))
          print('square root of', x, 'is', mm.sqrt(x))
      

      Good Enough For Us

      • You can be much more sophisticated than this in organizing your code, via:
        • Environment variable settings
        • Configuration files
        • And/or IDE configuration settings
      • Details are system-specific, IDE-specific
        • What we have shown will be good enough for us

      NumPy, Pandas, SciPy, and Statsmodels

      • Powerful and popular data analysis modules
      • NumPy: ndarray n-dimensional arrays, related math functions, linear algebra,...
      • Pandas: Series INDEXED DATA SERIES AND DataFrame "spreadsheet" like facilities
      • SciPy: efficient numerical routines: integraton, cubic spline, optimization,...
      • Statsmodels: statistical models, tests, data exploration, using DataFrames
      • We will follow an import naming convention common in Python examples and documentation:
      import numpy as np
      import pandas as pd
      import scipy as sp
      import statsmodels as sm
      import matplotlib.pyplot as plt
      

      NumPy ndarray

      • A one-dimensional ndarray is "like" an optimized list for vextor operations
        • Data in contigous memory
        • Vectorized computation algorithms in C
      • Individua item access with [] is the same for ndarray as for list
      • An ndarray does not store Python int values
        • Stores efficient but more restrictive C/C++ style 32-bit integers
      • An ndarray is iterable, so conversion to basic collection types is easy
      • To create an ndarray from a numeric iterable, use np.array (iterable)
      • Unlike a list, every element of an ndarray will be of the same type
        • Upcasting convert elements to the minimum type able to hold all objects
      ls1 = list(range(5))
      ls
      [0, 1, 2, 3, 4]
      
      a1 = np.arange(5) # 'array-range'
      a1
      array([0, 1, 2, 3, 4])
      
      a1[2]
      2
      
      a1[-1]
      4
      
      list(a1)
      [0, 1, 2, 3, 4]
      
      turple(a1)
      (0, 1, 2, 3, 4)
      

      ndarray Arithmetic: Vector vs. List Operations

      ### list operations:
      ls1 *=2  # concatenation
      ls1
      [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
      
      ls1 += 1  # undefined
      
      ls1 + ls1
      [0, 1, 2, 3, 4, ....., 0, 1, 2, 3, 4]
      
      ### ndarray vector operations:
      a1 *= 2  # scalar multiplication
      a1
      array ([0, 2, 4, 6, 8])
      
      a1 += 1  # add 1 to each element
      a1
      

      ndarray Slices

      • A slice of an ndarray is a view on part of the ndarray
      a2 = np.arrnage(10)
      a2[2:6]
      array([2,3,4,5])
      a2[2:6] = -8
      array([0,1,-8,-8,-8,-8,,3,4,5])
      

      ndarray copy

      • Must us copy() at get an independent copy of an ndarray or ndarray slice
      a4 = a2[:5].copy()
      

      N-dimensional ndarrays

      • One way create an N-dimensional ndarray is from a list of lists, or a tuple of tuples

      a1 = np.array([[1,2,3,4], [6,4,2,0]])
      
      a1.ndim 
      2 # 2-dimensional
      
      a1.shape
      {2, 4} # tuple: 2 rows, 4 columns
      
      * Or, reshape() an existing ndarray * Return a copy of the reshaped array: no change to the original array

      a2 = np.arrange(12)
      a3 = a2.reshape(3, 4)
      
      • Or, call a function that creates an ndarray of some shape, where shape is tuple
      np.ones(shape) # ndarray of all 1.0s
      np.zeros(shape) # ndarray of all 0.0s
      np.full(shape, val) # ndarray of all val
      np.eye(N) # N*N identity matrix
      np.identity(N) # N*N identity matrix
      

      2-dimensional ndarray Slices

      a1[1:, :3] # row starting from 1; column end but not including 3
      
      a1[2] # 3rd row
      a1[2].shape
      a1[2].ndim # a 1-D array
      
      a1[2][1] = 5 # same as a1[2,1]=5
      
      a1[2, :] # all columns 
      
      array([0,2,4,6,8], [-3,-9,-15,7,9], [15,5,9,-2,-1])
      a1[2:]
      array([[15,5,9,-2,-1]]
      a1[2:].shape
      (1,5)
      a1[2:].ndim
      2 # 2-D array
      
      a1[1,2] # item in row 1, col 3
      a1[1:2, 2] # row 1 thur <2, col 2
      

      Boolean Indexes

      • You can use a list or ndarray of Booleans to select a viw on subset of an ndarray
      • Apply an equality, inequality, or relational operator to an ndarray yields a Bollean ndarray of the same shape
      • For indexing purposes Boolean values are treated as binary
        • Use & for and, | for or, ^ for xor (one and only one is trus), ~ for invert
        • These operators have higher precedence than the equality and relational operators, so use (...)
      brows =[True, False, True]
      a1[brows]
      
      a1[:, [True, False, False, True, False]]
      
      a1[a1<-5] = 8
      
      a1[(a1>5) & (a1<9)] -= 5
      

      Integer (or "Fancy") Indexes

      • YOu can use a list or ndarray of integers to select a view on a subset of an ndarray
        • Can change the order of rows or columns, or create duplicates
      • From 2-D ndarray, you can use two integer index lists to create a 1-D ndarray of selected values
        • The two index lists must be the same length
      a1[[2,0]] # row 2 followed by row 0
      a1[:, [3,2,1,2]] # all rows, cols 3,2,1,2
      
      a1[[0,1,0,2], [0,4,2,0]] #[0,0], [1,4], [0,2], [2,0]
      

      More NumPy

      • NumPy offers many other facilities/methods
        • Vectorized array methods: fabs, sqrt, exp, log, ceil, floor, sin, arcsin,...
        • Statistical methods: mean, sum, cusum, std, var, min, max,...
        • Random number generators for many distributions
        • Linear algebra calculations
        • Sorting and set operations
        • File input and output

      Lecture 4: Intro to Pandas

      Pandas: Series

      • A Series is a one-dimensional sequence of values, together with a same-length sequence of labels: the index of the Series
        • By default, the index values are 0 through N-1
      • Index does not have to be int!
      s1 = pd.Series([3,5,2,4])
      
      s1
      0  3
      1  5
      2  2
      3  4
      dtype: int64
      
      s1[2]
      
      s1.values
      array([3,5,2,4], dtype=int64) # ndarray
      
      s1.index
      RangeIndex(start=0, stop=4, step=1) # like range
      
      s2=pd.Series([4,2,1,5], index=['a', 'x, 'C', '?'])
      

      Series and dict

      • It's easy to contruct a Series from a dict
        • Index will be from keys, values from values
      d1 = {'X': 123.2, 'AAPL': 543, 'CSCO': 47.7}
      s3 = pd.Series(d1)  # dict: ticeker, price
      s3
      
      X    123.2
      AAPL 543
      CSCO 47.7
      

      Series Index Check

      • You can use dict-style notation to check whether a Series does or does not have a specific index
      'AAPL' in s3
      True
      

      Series Arithmetic, Slicing, and Indexing

      s1 *=2  # times 2
      s4 = s1 / 2  # float64
      s1 //= 2  # int64
      s1 + s1 # int64
      np.sqrt(s1) # float64
      
      • If indexes differ between two Series, arithmetic will yield NAN (missing value or not applicable)
      • pd.isnull(Series) and pd.notnull(Series) yield Boolean indexes of NaN and non-NaN items, respectively
      pd.isnull(s1p6)
      

      Pandas: DataFrame

      • A DataFrame is "like" a spreadsheet, with named columns and N rows
        • Idea borrowed from R data.frame
        • Easy to build from a dict, with column names as keys and equal-length lists as values in rows
      f1 = pd.DataFrame(d2)
      

      DataFrame Column Access

      • One way to retrieve a column, as a Series:
      f1['CEO_buy']
      s1.name = 'I am s1'
      
      • One way to retrieve a row, as a Series:
      f1.loc[2] # the 3rd row
      f1.loc[2][0] # the 3rd row, 1st column
      
      • Three ways to retrieve a cell:
      f1['CEO_buy'][2]
      f1.loc[2]['CEO_buy']
      f1.loc[2, 'CEO_buy']
      

      DataFrame: Add a Column

      • Add a column to a DataFrame much as you would add a keyl value piar to a dict
      f1['P/E'] = 20.5
      f1['PEG_ratio'] = [17, 0.9, 2.2, 1.1]
      

      DataFrame: Delete a Column

      del f1['P/E']
      

      DataFrame: Add a Row

      • If the data types of the new row don't match the existing column types, the column types will be upcast!
      f1.loc[4] = ['z', True, 8.2, 1.3]
      

      DataFrame: Delete a Row

      • Deleting the problem row will not "deupcast" - beware!
      fic.drop(0.5) # specifiy index to drop, it creates a copy, it doesn't change fic
      

      DataFrame Slicing and Indexing

      • Slicing and indexing works for DateFrames much as for Series
      a12.reshape(4,3)
      
      f2 = pd.DateFrame(a12.reshape(4,3), index = ['r0', 'r1', 'r2', 'r3'], columns = ['c0', 'c1', 'c2'])
      
      f2['c0']
      f2.loc['r1']
      
      f2.loc['r0': 'r2'] # up to and include r2
      f2[0:2] # up to and not include index 2
      

      DataFrame loc and iloc

      • loc uses row index and column labels
      • iloc uses row index and column integers
      f2.loc['r1']
      s2.loc['r0': 'r1'] # r1 through and including r1
      s2.loc['r3', 'r1'] # r3 and r1
      
      f2.iloc[2]
      f2.iloc[1:3] # sub-1 through but not including sub-3
      f2.iloc[[3,2]] # sub-3 and sub-2
      
      f2.loc['r1', 'c2']
      f2.loc['r0': 'r2', ['c2', 'c0']]
      f2.iloc[[0, 3], [2, 0]]
      

      Arithmetic with Fill Values

      • We have seen that arithmetic combining Data Frames with unlike columns/rows introduces missing values
      • DataFrame named arithmetic functions allow you to specify a fill_value argument
        • The fill vale is used in place of a missing value in one DataFrame or the other
        • But not if the values is missing in both DataFrames!
      f3 = f2.copy()
      f3['c3'] = [1, 3, 5, 7]
      f3 = f3.drop('r1')
      f2 + f3
      
      f2.add(f3, fill_value=0)
      

      Sorting

      • To sort DataFrame rows, use sort_index()
        • For columns, sort_index(axis=1)
      f4 = pd.DataFrame(np.array(f2), index=['a', 'd', 'b', 'c'], columns=['z', 'x', 'y'])
      
      f4.sort_index() # sort rows
      f4.sort_index(axis=1) # sort cols
      
      f4.sort_index().sort_index(axis=1) # sort both rows and cols
      
      f4.sort_index(ascending=False).sort_index(axis=1) # rows descending, cols ascending
      

      DataFrame Summary Statistics

      • DataFrame provides "the usual suspects" for summary statistics
      np.random.seed(1)
      a5 = np.random.randn(3,3).round(3)
      f5 = pd.DataFrame(a5,index=['r0', 'r1', 'r2'], columns=['c0', 'c1', 'c2'])
      
      f5.sum()
      f5.mean()
      f5.ver()
      f5.std()
      f5.describe()
      
      idxmin(), inxmax() # row of min, max for each col
      median()
      prod() # product of row vals by col
      skew(), kurt() # skew, kutosis
      cumsum(), cumprod()
      diff()
      pct_change() # % change row over row
      

      Lecture 5: Regular Expressions, Reading/Writing Formatted Data, String Methods

      About Regular Expressions

      • Regular expressions are a compact, mathematical-style notatoon for matching patterns in text
        • Originally developed in the context of human lanaguage parsing
        • Then programming lanuage prammmar, parsers
      • Python provides an extensive collection of regular expression special characters and notations

      Basic Regular Expression Characters

      c Matches the character c, unless c is a regular expression special character ^ Matches the starting of a string