Skip to main content

Machine Learning Introduction Series by Women Who Code- Notes

I am attending ML Series initiated by Women Who Code which is six week long program where each week some topics in ML are covered. I will be posting my notes and assignments for each week in this blog.

ML Intro Series 1:
Series 1 was divided into two parts ML basics and Hypothesis testing, also Introductory lab on how to use colab was provided.

ML Basics/Intro

What is Machine Learning?
  • a subset of AI 
  • class of computer algorithms that learns from data 
  • algorithms that improve with experience
  • data and outputs are provided that results in a function that maps input to output, which can be used in multiple scenarios
Why ML now?
  • large computing power
  • big Data available
  • technologies that deal with data available
  • high storage capacity
  • higher RAM available
  • reduction in the gap between academia and industry
Terms
  • data
  • features
  • target variable
Types of ML
  • supervised
  • unsupervised
  • semi Supervised
  • reinforcement Learning 
Supervised Learning 
  • known input and output, training examples
  • unknown, function that maps input to output
  • goal to find the function
Types of Supervised Learning
  • regression when target is continuous
  • classification when target is categorical
We basically need Supervised Learning when there is no human expert for the task, humans can't describe task, function is changing frequently or we need personalised function for each use case.

Hypothesis Testing

  • Hypothesis test calculates some quantity under a given assumption. 
  • The result of the quantity tells us whether assumption holds true or is violated.

Normal Distribution is a type of population distribution that is most commonly found in natural phenomena.

Conducting a hypothesis test
Test starts with an assumption that a null hypothesis, also called default hypothesis hold true and a violation this assumption called first hypothesis is also called alternate hypothesis.

P-Value
  • It is a quantity that can be used to interpret the result of hypothesis test. 
  • Many test statistics can be used to calculate p-value.

Alpha
  • It is the significance level used to accept or reject a hypothesis.
  • It is generally 5% or 0.05. Lower the alpha higher the confidence.
  • Confidence is 1 minus alpha.
Errors in statistical tests
  • type 1 error which is false positive
  • type 2 error which is false negative
Homework
  • Linking colab with github
  • Test Statistics Understanding on Wikipedia
  • Z- Score, two tailed test for numerical problem


ML Intro Series 2:
In series 2 of ML series, Conditional Probability, Naive Bayes, Bayesian Learning were discussed and a lab on implementation of Naive based classifier using Scikit learn was also there. 

Classification
In classification responses are categorical in nature.

Events can be
  • dependent
  • independent
Conditional Probability
  • defines probability between dependent events
  • occurrence of one event changes the probability of other event
Bayes Theorem Links

Bayes Theorem Formula
P(A|B)=P(B|A)*P(A)/P(B)

Prior Probability
Probability of an event that has occurred.

Posterior Probability
Probability of an event that is going to occur.

Naive Bayes Classifier is based on the principal of Bayes Theorem

  • It assumes all features are independent of each other.
  • All features contributes equally.
These assumptions can be wrong and due to these assumptions, this classifier is called naive.


Comments

Popular posts from this blog

OCR Image Text Detection and Image Manipulation Project

Developed as a course project, the main goal behind this project was to test ability to learn and use python libraries , use openCV to detect faces, tesseract to do optical character recognition and ability to use PIL to composite images together into contact sheets.  Task was to write python code which allows one to search through the images looking for occurrences of keywords and faces, to perform text detection on newspaper images data and  r eturn a contact sheet of all the faces which were located on the newspaper page which mentions that text . I divided whole task into subtasks into functions like, get files, binarise, to check is string is found, to chow faces, to show sheet,  building contact sheet and used libraries to achieve each task, like for images used PIL, cv2, etc 

Successful Presentation Notes

4 Modules Fear, The Formula, Practice, Engaging the audience Its takes practice and confidence Successful presentation is a public performance of the private self. Prof. Kuskin’s ten best practices for successful public speaking- Number 1 The Keith Code Rule: Fear inhibits execution. By controlling fear, we are able to execute on multiple levels. Number 2 Successful Presentation is storytelling. Number 3 Stories have a Beginning, Middle and Ending and often have single main meaning or moral. Number 4 Stories are always, in part, about the storyteller or people want to see people. Number 5 Stories are told; indeed, great stories are performed. Number 6 The Basic Formula is a simple three by three grid: Introduction, Content, & Conclusion, each of which have three parts: a. The Introduction has three elements: Salutation, Review, & The One Compelling Point b. The Content has three parts: Topic, Data, & Analysis c. The Conclusion has three parts: Summary, Discussion, & Tha...

Guided Projects from DataQuest

Guided Project: Finding Heavy Traffic Indicators on I-94 This project was about analysing a dataset about the westbound traffic on the I-94 Interstate highway and goal of analysis was to determine a few indicators of heavy traffic on I-94 which could be weather type, time of the day, time of the week, etc. My analysis concluded summer months, business days and 7AM or 4PM hours indicated very high traffic volume while winter months, weekends, normal hours were low on traffic and two weather conditions light rain and snow and clear sky also indicated moderately high traffic. I used exploratory data analysis and visualization to work through this analysis. Code Link Guided Project: Exploring Hacker News Posts I worked with Hacker News Posts Dataset to determine the questions like: Do Ask HN or Show HN receive more comments on average? Do posts created at a certain time receive more comments on average? I worked on preparing dataset, cleaning it, sorting using python strings, date and time...