Главная
Advanced Data Science and Analytics with Python
Advanced Data Science and Analytics with Python
Jesús RogelSalazar
0 /
0
Насколько Вам понравилась эта книга?
Какого качества скаченный файл?
Скачайте книгу, чтобы оценить ее качество
Какого качества скаченные файлы?
Advanced Data Science and Analytics with Python
enables data scientists to continue developing their skills and apply them in business as well as academic settings. The subjects discussed in this book are complementary and a followup to the topics discussed in Data Science and Analytics with Python. The aim is to cover important advanced areas in data science using tools developed in Python such as SciKitlearn, Pandas, Numpy, Beautiful Soup, NLTK, NetworkX and others. The model development is supported by the use of frameworks such as Keras, TensorFlow and Core ML, as well as Swift for the development of iOS and MacOS applications.
Features:
Targets readers with a background in programming, who are interested in the tools used in data analytics and data science Uses Python throughout Presents tools, alongside solved examples, with steps that the reader can easily reproduce and adapt to their needs Focuses on the practical use of the tools rather than on lengthy explanations Provides the reader with the opportunity to use the book whenever needed rather than following a sequential path
The book can be read independently from the previous volume and each of the chapters in this volume is sufficiently independent from the others, providing flexibility for the reader. Each of the topics addressed in the book tackles the data science workflow from a practical perspective, concentrating on the process and results obtained. The implementation and deployment of trained models are central to the book.
Time series analysis, natural language processing, topic modelling, social network analysis, neural networks and deep learning are comprehensively covered. The book discusses the need to develop data products and addresses the subject of bringing models to their intended audiences  in this case, literally to the users' fingertips in the form of an iPhone app.
About the Author
Dr. Jesús RogelSalazar is a lead data scientist in the field, working for companies such as Tympa Health Technologies, Barclays, AKQA, IBM Data Science Studio and Dow Jones. He is a visiting researcher at the Department of Physics at Imperial College London, UK and a member of the School of Physics, Astronomy and Mathematics at the University of Hertfordshire, UK.
enables data scientists to continue developing their skills and apply them in business as well as academic settings. The subjects discussed in this book are complementary and a followup to the topics discussed in Data Science and Analytics with Python. The aim is to cover important advanced areas in data science using tools developed in Python such as SciKitlearn, Pandas, Numpy, Beautiful Soup, NLTK, NetworkX and others. The model development is supported by the use of frameworks such as Keras, TensorFlow and Core ML, as well as Swift for the development of iOS and MacOS applications.
Features:
Targets readers with a background in programming, who are interested in the tools used in data analytics and data science Uses Python throughout Presents tools, alongside solved examples, with steps that the reader can easily reproduce and adapt to their needs Focuses on the practical use of the tools rather than on lengthy explanations Provides the reader with the opportunity to use the book whenever needed rather than following a sequential path
The book can be read independently from the previous volume and each of the chapters in this volume is sufficiently independent from the others, providing flexibility for the reader. Each of the topics addressed in the book tackles the data science workflow from a practical perspective, concentrating on the process and results obtained. The implementation and deployment of trained models are central to the book.
Time series analysis, natural language processing, topic modelling, social network analysis, neural networks and deep learning are comprehensively covered. The book discusses the need to develop data products and addresses the subject of bringing models to their intended audiences  in this case, literally to the users' fingertips in the form of an iPhone app.
About the Author
Dr. Jesús RogelSalazar is a lead data scientist in the field, working for companies such as Tympa Health Technologies, Barclays, AKQA, IBM Data Science Studio and Dow Jones. He is a visiting researcher at the Department of Physics at Imperial College London, UK and a member of the School of Physics, Astronomy and Mathematics at the University of Hertfordshire, UK.
Категория:
Год:
2020
Издательство:
CRC Press
Язык:
english
Страницы:
424
ISBN 10:
0429446616
ISBN 13:
9780429446610
Серия:
Chapman & Hall/CRC Data Mining and Knowledge Series
Файл:
PDF, 22,00 MB
Ваши теги:
Скачать (pdf, 22,00 MB)
 Открыть в браузере
 Checking other formats...
 Пожалуйста, сначала войдите в свой аккаунт

Нужна помощь? Пожалуйста, ознакомьтесь с инструкцией как отправить книгу на Kindle
В течение 15 минут файл будет доставлен на Ваш email.
В течение 15 минут файл будет доставлен на Ваш kindle.
Примечание: Вам необходимо верифицировать каждую книгу, которую Вы отправляете на Kindle. Проверьте свой почтовый ящик на наличие письма с подтверждением от Amazon Kindle Support.
Примечание: Вам необходимо верифицировать каждую книгу, которую Вы отправляете на Kindle. Проверьте свой почтовый ящик на наличие письма с подтверждением от Amazon Kindle Support.
Возможно Вас заинтересует Powered by Rec2Me
Ключевые слова
data^{720}
network^{447}
nodes^{353}
python^{322}
model^{295}
data science^{258}
rogel^{237}
salazar^{237}
science and analytics^{226}
neural^{221}
function^{215}
analytics with python^{214}
advanced data^{196}
advanced data science^{196}
layer^{182}
series^{153}
learning^{147}
node^{142}
values^{131}
neural network^{127}
text^{127}
networks^{121}
dataset^{119}
value^{112}
graph^{108}
characters^{104}
method^{102}
input^{102}
time series^{100}
centrality^{99}
example^{98}
import^{92}
output^{90}
https^{89}
result^{88}
hidden^{88}
training^{86}
layers^{84}
shown^{81}
print^{80}
app^{80}
language^{79}
club^{78}
topic^{78}
edges^{76}
create^{75}
degree^{75}
pandas^{73}
architecture^{71}
algorithm^{70}
code^{68}
follows^{67}
activation^{66}
analysis^{65}
neural networks^{65}
karate^{62}
parameters^{61}
core ml^{60}
calculate^{59}
format^{58}
Связанные Буклисты
0 comments
Вы можете оставить отзыв о книге и поделиться своим опытом. Другим читателям будет интересно узнать Ваше мнение о прочитанных книгах. Независимо от того, пришлась ли Вам книга по душе или нет, если Вы честно и подробно расскажете об этом, люди смогут найти для себя новые книги, которые их заинтересуют.
1

2

Advanced Data Science and Analytics with Python Chapman & Hall/CRC Data Mining and Knowledge Series Series Editor: Vipin Kumar Text Mining and Visualization Case Studies Using OpenSource Tools Markus Hofmann and Andrew Chisholm GraphBased Social Media Analysis Ioannis Pitas Data Mining A TutorialBased Primer, Second Edition Richard J. Roiger Data Mining with R Learning with Case Studies, Second Edition Luís Torgo Social Networks with Rich Edge Semantics Quan Zheng and David Skillicorn LargeScale Machine Learning in the Earth Sciences Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser Data Science and Analytics with Python Jesús RogelSalazar Feature Engineering for Machine Learning and Data Analytics Guozhu Dong and Huan Liu Exploratory Data Analysis Using R Ronald K. Pearson Human Capital Systems, Analytics, and Data Mining Robert C. Hughes Industrial Applications of Machine Learning Pedro Larrañaga et al Automated Data Analysis Using Excel Second Edition Brian D. Bissett Advanced Data Science and Analytics with Python Jesús RogelSalazar For more information about this series please visit: https://www.crcpress.com/ChapmanHallCRCDataMiningandKnowledgeDiscoverySeries/bookseries/CHDAMINODIS Advanced Data Science and Analytics with Python Jesús RogelSalazar First edition published 2020 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 334872742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2020 Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any ; copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 9787508400. For works that are not available on CCC please contact mpkbookspermissions@tandf.co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. ISBN: 9780429446610 (hbk) ISBN: 9781138315068 (pbk) ISBN: 9780429446641 (ebk) To A. J. Johnson Then. Now. Always. Contents 1 No Time to Lose: Time Series Analysis 1.1 Time Series 1.2 One at a Time: Some Examples 1.3 Bearing with Time: Pandas Series 2 4 1.3.1 Pandas Time Series in Action 1.3.2 Time Series Data Manipulation 1.4 1 Modelling Time Series Data 7 18 21 31 1.4.1 Regression. . . (Not) a Good Idea? 1.4.2 Moving Averages and Exponential Smoothing 1.4.3 Stationarity and Seasonality 1.4.4 Determining Stationarity 1.4.5 Autoregression to the Rescue 1.5 Autoregressive Models 1.6 Summary 56 34 39 42 51 48 36 viii 2 Speaking Naturally: Text and Natural Language Processing 2.1 2.1.1 2.2 2.2.1 Pages and Pages: Accessing Data from the Web Beautiful Soup in Action Make Mine a Regular: Regular Expressions 2.3 Processing Text with Unicode 88 2.4 Tokenising Text 96 2.5 Word Tagging 102 2.6 What Are You Talking About?: Topic Modelling Latent Dirichlet Allocation 2.6.2 LDA in Action Summary 59 77 79 2.6.1 109 110 115 129 Getting Social: Graph Theory and Social Network Analysis 3.1 Socialising Among Friends and Foes 3.2 Let’s Make a Connection: Graphs and Networks Taking the Measure: Degree, Centrality and More 3.2.2 Connecting the Dots: Network Properties 149 Social Networks with Python: NetworkX 156 3.3.1 3.4 NetworkX: A Quick Intro 131 132 3.2.1 3.3 57 64 Regular Expression Patterns 2.7 3 j. rogelsalazar 140 145 156 Social Network Analysis in Action 162 3.4.1 Karate Kids: Conflict and Fission in a Network 3.4.2 In a Galaxy Far, Far Away: Central Characters in a Network 162 189 advanced data science and analytics with python 3.5 4 Summary 205 Thinking Deeply: Neural Networks and Deep Learning 4.1 A Trip Down Memory Lane 4.2 NoBrainer: What Are Neural Networks? 208 214 4.2.1 Neural Network Architecture: Layers and Nodes 4.2.2 Firing Away: Neurons, Activate! 4.2.3 Going Forwards and Backwards 4.3 218 223 Neural Networks: From the Ground up 4.3.1 Going Forwards 4.3.2 Learning the Parameters 4.3.3 Backpropagation and Gradient Descent 4.3.4 Neural Network: A First Implementation 4.4 227 229 232 Neural Networks and Deep Learning 234 243 254 4.4.1 Convolutional Neural Networks 4.4.2 Convolutional Neural Networks in Action 4.4.3 Recurrent Neural Networks 4.4.4 Long ShortTerm Memory 4.4.5 Long ShortTerm Memory Networks in Action 4.5 Summary 300 215 263 268 279 286 290 207 ix x 5 j. rogelsalazar Here Is One I Made Earlier: Machine Learning Deployment 5.1 The Devil in the Detail: Data Products 304 5.2 Apples and Snakes: Core ML + Python 309 5.3 Machine Learning at the Core: Apps and ML 313 5.3.1 Environment Creation 5.3.2 Eeny, Meeny, Miny, Moe: Model Selection 5.3.3 Location, Location, Location: Exploring the Data 5.3.4 Modelling and Core ML: A Crucial Step 5.3.5 Model Properties in Core ML 5.4 314 Surprise and Delight: Build an iOS App New Project: Xcode 5.4.2 Push My Buttons: Adding Functionality 5.4.3 Being Picky: The Picker View 5.4.4 Model Behaviour: Core ML + SwiftUI Summary 331 332 344 347 350 355 A Information Criteria B Power Iteration C The Softmax Function and Its Derivative C.1 322 329 5.4.1 5.5 315 359 361 Numerical Stability 365 363 317 303 advanced data science and analytics with python D The Derivative of the CrossEntropy Loss Function Bibliography Index 379 369 367 xi List of Figures 1.1 A time series of the log returns for Apple Inc. for a year since April 2017. 3 1.2 Solar activity from 1749 through 2013. 6 1.3 Closing prices for Apple Inc. for a year since April 2017. 7 1.4 Total of monthly visitors for the data entered manually. 14 1.5 Open, high, low and close prices for the exchange rate of bitcoin/USD. 31 1.6 White noise with zero mean, constant variance, and zero correlation. 33 1.7 Closing prices for Apple Inc. for a year since April 2017 and a trend line provided by a multivariate regression. 35 1.8 Moving averages (upper panel) and exponential smoothing (lower panel) applied to the closing prices for Apple Inc. 38 1.9 Analysis of the power spectrum of the sunspots data. We can see that a maximum in activity occurs approximately every 11 years. 42 xiv j. rogelsalazar 1.10 Sunspot activity and rolling statistics for the average and the standard deviation. 43 1.11 Trend, seasonality and residual components for the sunspot dataset. 46 1.12 Trend, seasonality and residual components for the bitcoin dataset. 47 1.13 Autocorrelation and partial autocorrelation for the sunspot dataset. 49 1.14 Autocorrelation and partial autocorrelation for the bitcoin dataset. 50 1.15 Prediction for the sunspot activity using an ARMA(9, 0) model. 55 2.1 A very simple webpage. 63 2.2 A preview of the Iris HTML webpage. 64 2.3 A schematic representation of HTML as a tree. We are only showing a few of the branches. 67 2.4 A chunked sentence with two named entities. 107 2.5 Top 10 named entities in the 2009 speech made by Barack Obama before a Joint Session of the Congress. 109 2.6 Graphical model representation of LDA. 114 3.1 An example of a social network with directed edges. 133 3.2 The ego network for Terry G. Only the related nodes are highlighted and the rest are dimmed down for clarity. 134 3.3 Transitivity in a network. 135 advanced data science and analytics with python 3.4 A schematic geographical representation of the seven bridges of Königsberg and a network highlighting the connectivity of the four land masses in question. 136 3.5 An example graph with seven nodes, and two subgraphs. 142 3.6 A simple graph depicting eight nodes and five edges. 159 3.7 Zachary’s karate club: 34 individuals at the verge of a club split. Edges correspond to friendship relationships among club members. 165 3.8 Degree measure of the Zachary karate club network. The size of the nodes denotes the degree and the color corresponds to the groups formed after the split of the club. The darker grey nodes are Mr. Hi’s group and the light grey ones are John A’s supporters. 170 3.9 Frequencies of the degree centrality measure for the karate club network. 172 3.10 Degree centrality measure of Zachary’s karate club. The size of the nodes denotes the degree centrality. We can see the importance of not only nodes 1, 34, 33, but also 2 and 3. 173 3.11 Betweenness of Zachary’s karate club network. The size of the nodes denotes the betweenness. We can see the importance of nodes 1, 34, as well as 33 and 3. Node 32 is a bridge in the network. 174 xv xvi j. rogelsalazar 3.12 Closeness of Zachary’s karate club network. The size of the nodes denotes the closeness. We can see the importance of the nodes we already know about: 1, 34, 33 and 3. Node 9 is a close node in the network too. 177 3.13 Eigenvector centrality of Zachary’s karate club network. The size of the nodes denotes the eigenvector centrality of the network. 178 3.14 PageRank of Zachary’s karate club network. The size of the nodes denotes the PageRank scores of the network. 180 3.15 Reduced network for Zachary’s karate club. We have removed nodes 2, 3, 9 and 32 that are important for the cohesion of the network. The size of the nodes denotes the degree centrality of the nodes. 181 3.16 kcomponents of Zachary’s karate club network. 182 3.17 Some of the cliques in Zachary’s karate club network. 183 3.18 Hierarchical clustering over Zachary’s karate club network. 185 3.19 Communities discovered by the GirvanNewman algorithm on Zachary’s karate club network. Notice that nodes 3 and 9 have been assigned to John A.’s faction. 186 3.20 Communities discovered by the Louvain algorithm on Zachary’s karate club network. We have four communities denoted by different shades of grey. 188 3.21 Star Wars network covering Episodes IVII. Layout inspired by the famous Death Star. 194 advanced data science and analytics with python 3.22 Distribution of the degree centrality for the Star Wars network. 195 3.23 Degree measure of the Star Wars network. The size of the nodes denotes the degree centrality of the node. 197 3.24 Eigenvector centrality for the Star Wars network. The size of the nodes denotes the eigenvector centrality of the node. 200 3.25 PageRange for the nodes in the Star Wars network. The size of the nodes denotes the PageRank score for the node. 200 3.26 Vader networks for the following centrality measures: Degree centrality, eigenvector centrality, PageRank and betweenness. 201 3.27 Star Wars sides (communities) obtained with the application of the GirvanNewman algorithms. 204 4.1 Neural network architecture with a single hidden layer. 213 4.2 An artificial neural network takes up an input and combines the contributions of the nodes to calculate an output ŷ with the aid of a nonlinear function with the sum of its inputs. 214 4.3 Neural network architecture with a single hidden layer, including bias. The inputs to a node (marked in gray) are used in conjunction with the weights wi to calculate the output with the help of the activation function f (·). 219 4.4 Zooming into one of the hidden nodes in our neural network architecture. 220 xvii xviii j. rogelsalazar 4.5 Some common activation functions, including sigmoid, tanh and ReLU. 221 4.6 A plot of the softmax function. 222 4.7 Backward propagation of errors, or backpropagation, enables the neural network to learn from its mistakes. 225 4.8 General architecture of a neural network; we are showing the labels of the different L layers in the network. 232 4.9 The derivative of a function f indicates the rate of change at a given point. This information lets us change our parameters accordingly. 235 4.10 Observations corresponding to two classes, 0 and 1, described by features x1 and x2 . We will use this data to train a neural network. 244 4.11 Classification boundary obtained with a 3node hidden layer neural network. The discrimination is modelled well with a cubiclike function. 250 4.12 Classification boundaries for a neural network with one hidden layer comprising 1, 2, 3, 10, 30 and 50 hidden nodes. 253 4.13 Classification boundary obtained with a sequential model for a neural network implemented in Keras. 261 4.14 An image of a letter J (on the left). After applying an identity kernel the result is a scaled down version of the image (on the right). 264 4.15 An image of a Jackalope icon (on the left). After applying a sharpening filter, we obtain the image on the right. 265 advanced data science and analytics with python 4.16 Architecture of a convolutional neural network. 268 4.17 Example images for each of the ten classes in the CIFAR10 dataset. The pixelation is the result of the images being 32 × 32. 269 4.18 A picture of a nice feline friend to test our convolutional neural network. 278 4.19 A diagrammatic representation of the architecture of a recurrent neural network. 281 4.20 The inner workings of a long shortterm memory neural network. 287 5.1 We follow this workflow to deploy our machine learning models to our app. 314 5.2 A line of best bit for the observations y dependent of features x1 . 317 5.3 Boston house prices versus average number of rooms (top) and per capita crime rate (bottom). 321 5.4 Visualisation of the Boston house price model converted into Core ML format. 328 5.5 Properties of the Boston Pricer Core ML model created from Scikitlearn. 330 5.6 Creating a new XCode project for a Single View App. 332 5.7 We need to provide some metadata for the project we are creating. 333 5.8 The LaunchScreen.storyboard element is the main interface presented to our users. 334 5.9 Open the Library with the plus icon, and the Object Library with the icon that looks like a square inside a circle. 335 xix xx j. rogelsalazar 5.10 Drag and drop your image into the Assets.xcassets folder. 336 5.11 Select your image in the Attribute Inspector. 5.12 Auto layout errors. 337 337 5.13 Let us centre the image vertically and horizontally. 337 5.14 We can put constraints on the height, width and aspect ratio of our image. 338 5.15 We can edit the added contraints for width and aspect ratio. 339 5.16 We are now adding constraints to one of the labels. 340 5.17 Running our app up until this point will show the splash page created, followed by the “Hello Word” message shown in all its glory. 341 5.18 The autogenerated code that prints “Hello World” to the screen can be found in the ContentView.Swift file. 342 5.19 The attributes can be changed in the preview. 343 5.20 The app layout is automatically handled with SwiftUI. 344 5.21 The app state after pressing the button. 347 5.22 Adding a couple of pickers to our app. 348 5.23 The pickers are now showing the correct values we specified. 349 5.24 We can see that the app is capturing the correct state for the pickers. 350 5.25 Adding a New Group to our project. 5.26 Adding resources to our Xcode project. 351 351 advanced data science and analytics with python 5.27 The final app producing predictions for our users out of a linear regression model first developed with Python. 354 xxi List of Tables 1.1 Offset aliases used by Pandas to represent common time series frequencies. 15 1.2 Descriptive statistics for the data entered manually. We are not including the count in this table. 16 1.3 Some format directives for the strftime method. 17 1.4 Parameters specifying the decay applied to an exponential smoothing calculation with ewm. 2.1 Common HTML tags. 38 61 2.2 Regular expression patterns. We use ellipses (...) to denote sequences of characters. 80 3.1 Character rankings for the most central characters in the Star Wars saga given by various centrality measures. 202 4.1 Capabilities of neural networks with a different number of hidden layers. 217 5.1 Models and frameworks supported by Core ML. 312 Preface Writing a book is an exhilarating experience, if at times a bit hard and maddening. This companion to Data Science and Analytics with Python1 is the result of arguments with myself about writing something to cover a few of the areas that were not included in that first volume, largely due to RogelSalazar, J. (2017). Data Science and Analytics with Python. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. CRC Press 1 space/time constraints. Like the previous book, this one exists thanks to the discussions, standups, brainstorms and eventual implementations of algorithms and data science projects carried out with many colleagues and friends. The satisfaction of seeing happy users/customers with products they can rely on is, and will continue to be, a motivation for me. The subjects discussed in this book are complementary and a followup to the ones covered in Volume 1. The intended audience for this book is still composed of data analysts and earlycareer data scientists with some experience in programming and with a background in statistical modelling. In this case, however, the expectation is that they have already covered some areas of machine learning and data analytics. Although I will refer to the previous book in The book and its companion are a good reference for seasoned practitioners too. xxvi j. rogelsalazar parts where some knowledge is assumed, the book is written to be read independently from Volume 1. As the title suggests, this book continues to use Python2 as a tool to train, test and implement machine learning models and Python Software Foundation (1995). Python reference manual. 2 http://www.python.org algorithms. Nonetheless, Python does not live in isolation, and in the last chapter of this book we touch upon the usage of Swift3 as a programming language to help us deploy our machine learning models. 3 Apple Inc. (2014). Swift programming language. https://swift.org Python continues to be, in my view, a very useful tool. The number of modules, packages and contributions that Pythonistas have made to the rest of the community make it a worthwhile programming language to learn. It is no surprise that the number of Python users continues to grow. Similarly, the ecosystem of the language is also evolving: From the efforts to bring Python 3.x to be the version of choice, through to the development of the computational Visit https://jupyterlab. readthedocs.io for further information. environment that is the Jupyter Notebook and its evolution, the JupyterLab. For those reasons, we will continue using some excellent libraries, such as Scikitlearn4 , Pandas5 , Numpy6 and others. After all, we have seen Nobel prize winning research being supported by Python, as have been a number of commercial enterprises, including consultancies, startups and established companies. The decision to use Python for this second volume is therefore not just one of convenience and continuity, but a conscious adoption that I hope will support you too. 4 Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, et al. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 5 McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media 6 Scientific Computing Tools for Python (2013). NumPy. http://www.numpy.org advanced data science and analytics with python xxvii As I mentioned above, the book covers aspects that were necessarily left out in the previous volume; however, the readers in mind are still technical people interested in moving into the data science and analytics world. I have tried to keep the same tone as in the first book, peppering the pages with some bits and bobs of popular culture, science fiction and indeed Monty Python puns. The aim I sincerely hope the most obscure is still to focus on showing the concepts and ideas behind ones do make you revisit their excellent work. popular algorithms and their use. As before, we are not delving, in general, into exhaustive implementations from scratch, and instead relying on existing modules. The examples contained here have been tested in Python 3.7 under MacOS, Linux and Windows 10. We do recommend that you move on from Python 2. For reference, the versions Maintenance for Python 2 has of some of the packages used in the book are as follows: stopped as of January 2020. Python  3.5.2 Pandas  0.25 NumPy  1.17.2 Scikitlearn  0.21 SciPy  1.3.1 StatsModels  0.10 BeautifulSoup  4.8.1 NLTK  3.4.5 NetworkX  2.4 Keras  2.2.4 TensorFlow  1.14.0 As before, I am using the Anaconda Python distribution7 provided by Continuum Analytics. Remember that there are other ways of obtaining Python as well as other versions of 7 Continuum Analytics (2014). Anaconda 2.1.0. https://store. continuum.io/cshop/anaconda/ xxviii j. rogelsalazar the software: For instance, directly from the Python Software Foundation, as well as distributions from Python Software Foundation Enthought Canopy, or from package managers such as https://www.python.org Homebrew. In Chapters 4 and 5, we create conda environments to install and maintain software relevant to Enthought Canopy https://www. enthought.com/products/epd/ the discussions for those chapters, and you are more than welcome to use other virtual environment maintainers too. Homebrew http://brew.sh We show computer code by enclosing it in a box as follows: > 1 + 1 # Example of computer code 2 We use a diple (>) to denote the command line terminal prompt shown in the Python shell. Keeping to the look and feel of the previous book, we use margin notes, such as the one that appears to the right of this paragraph, to This is an example of the margin highlight certain areas or commands, as well as to provide notes used throughout this book. some useful comments and remarks. As mentioned before, the book can be read independently from the previous volume, and indeed each chapter is as selfcontained as possible. I would also like to remind you that writing code is not very dissimilar to writing poetry. I hope Sor Juana would forgive If I asked that each of us write a poem about the beauty of my comparison. a Jackalope, we would all come up with something. Some would write odes to Jackalopes that would be remembered by generations to come; some of us would complete the task with a couple of rhymes. In that way, the code presented here may not be award winning poetry, but the aim, I hope, advanced data science and analytics with python xxix will be met. I would welcome to hear about your poems. Do get in touch! We start in Chapter 1 with a discussion about time series Time series data and analysis is data. We see how Pandas has us covered to deal with the covered in Chapter 1. fiendish matter of date data types. We learn how to use time series data similar to that found in stock markets and see how Pandas lets us carry out resampling, slicing and dicing, filtering, aggregating and plotting this kind of data. In terms of modelling, in this chapter we see how moving averages and exponential smoothing let us get a first approach at forecasting future values of the series based on previous observations. We look at autoregression and see how it can be used to model time series. In Chapter 2, we take a look at processing text data Natural language processing is containing natural language. We look at how we can obtain covered in Chapter 2. data from the web and scrape data that otherwise would be out of reach to us. We take a look at the use of regular expressions to capture specific patterns in a piece of text and learn how to deal with Unicode. Looking at text data in this way leads us to the analysis of language, culminating with topic modelling as an unsupervised learning task to identify the possible subjects or topics that are addressed in a set of documents. In Chapter 3, we look into some fundamental concepts used Chapter 3 covers the use of graphs in the analysis of networks, whether social or otherwise. and network analysis, a topic We look at graph theory as a way to discover relationships social. encoded in networks such as smallworld ones. We have a chance to talk about measures such as degree centrality, that will inevitably make us more xxx j. rogelsalazar closeness, betweenness, and others. We even do this with characters from a galaxy far, far away. :) Chapter 4 is probably the deepest chapter of all, pun definitely intended. It is here where we turn our attention to Chapter 4 looks at neural the “unreasonable effectiveness” of neural networks. We networks and deep learning. look at the general architecture of a neural network and build our own from scratch. Starting with feedforward networks, we move on to understand the famous backpropagation algorithm. We get a chance to look at the effect of the number of layers as well as the number of nodes in each of them. We then move on to the implementation of more complex, deeper architectures, such as convolutional and recurrent neural networks. Finally, in Chapter 5, we look at the perennial issue of Chapter 5 looks at the deployment bringing our models, predictions and solutions to our of machine learning models. customers, users and stakeholders. Data products are the focus of our discussion, and we see how the availability, processing, meaning and understanding of data should be at the heart of our efforts. We then look at the possibility of bringing our models to the hands of our users via the implementation of a model inside a mobile device application in an Apple device such as an iPhone via Core ML. Remember that there is no such thing as a perfect model, only good enough ones. The techniques presented in this book, and the companion volume, are not the end of the story, they are the beginning. The data that you have to deal with will guide your story. Do not let the anthropomorphic advanced data science and analytics with python xxxi language of machine learning fool you. Models that learn, see, understand and recognise are as good as the data used to build them, and as blind as the human making decisions based on them. Use your Jackalope data science skills to inform your work. As I said before, this book is the product of many interactions over many moons. I am indebted to many people that have directly and indirectly influenced the You know who you are! words you have before you. Any errors, omissions or simplifications are mine. As always, I am grateful to my family and friends for putting up with me when I excuse myself with the old phrase:“I have to do some book... I am Do some work on the book of behind”. Thank you for putting up with another small project course... from this crazy physicist! London, UK Dr Jesús RogelSalazar March 2020 Reader’s Guide This book is intended to be a companion to any Jackalope data scientist that is interested in continuing the journey following the subjects covered in Data Science and Analytics with Python8 . The material covered here is fairly independent from the book mentioned above though. The chapters in this book can be read on their own and in any order you desire. If you require some direction though, here is a guide that may help in reading and/or consulting the book: • Managers and readers curious about Data Science: – Take a look at the discussion about data products in Chapter 5. This will give you some perspective of the areas that your Jackalope data scientists need to consider in their daytoday work. – I recommend you also take a look at Chapters 1 and 3 of the companion book mentioned above. – Make sure you understand those chapters insideout; they will help you understand your rangale of Jackalope data scientists. 8 RogelSalazar, J. (2017). Data Science and Analytics with Python. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. CRC Press xxxiv j. rogelsalazar • Beginners: – Start with Chapters 2 and 3 of the companion book. They will give you a solid background to tackle the rest of this book. – Chapter 1 of this book provides a good way to continue learning about the capabilities of Pandas. – Chapter 2 of this book on natural language processing will give you a balanced combination of powerful tools, with an easy entry level. • Seasoned readers and those who have covered the first volume of this series may find it easier to navigate the book by themes or subjects: – Time Series Data is covered in Chapter 1, including: * Handling of date data * Time series modelling * Moving averages * Seasonality * Autoregression – Natural Language Processing is covered in Chapter 2, including: * Text data analysis * Web and HTML scraping * Regular expressions * Unicode encoding * Text tokenisation and word tagging * Topic modelling advanced data science and analytics with python – Network Analysis is discussed in Chapter 3, including: * Graph theory * Centrality measures * Community detection and clustering * Network representation – Neural networks and Deep Learning is addressed in Chapter 4, where we look at: * Neural network architecture * Perceptron * Activation functions * Feedforward networks * Backpropagation * Deep learning * Convolutional neural networks * Recurrent neural networks * LSTM – Model Deployment and iOS App Creation is covered in Chapter 5, including: * Data products * Agile methodology * App design * Swift programming language * App deployment xxxv About the Author Dr Jesús RogelSalazar is a lead data scientist with experience in the field working for companies such as AKQA, IBM Data Science Studio, Dow Jones, Barclays and Tympa Health Technologies. He is a visiting researcher at the Department of Physics at Imperial College London, UK and a member of the School of Physics, Astronomy and Mathematics at the University of Hertfordshire, UK. He obtained his doctorate in Physics at Imperial College London for work on quantum atom optics and ultracold matter. He has held a position as senior lecturer in mathematics, as well as a consultant and data scientist, for a number of years in a variety of industries, including science, finance, marketing, people analytics and health, among others. He is the author of Data Science and Analytics with Python and Essential MATLAB® and Octave, both also published by CRC Press. His interests include mathematical modelling, data science and optimisation in a wide range of applications, including optics, quantum mechanics, data journalism, finance and health. Other Books by the Same Author • Data Science and Analytics with Python CRC Press, 2018, ISBN 9781138043176 (hardback) 9781498742092 (paperback) Data Science and Analytics with Python is designed for practitioners in data science and data analytics in both academic and business environments. The aim is to present the reader with the main concepts used in data science using tools developed in Python. The book discusses what data science and analytics are, from the point of view of the process and results obtained. • Essential MATLAB® and Octave CRC Press, 2014, ISBN 9781138413115 (hardback) 9781482234633 (paperback) Widely used by scientists and engineers, wellestablished MATLAB® and opensource Octave provide excellent capabilities for data analysis, visualisation, and more. By means of straightforward explanations and examples from different areas in mathematics, engineering, finance, and physics, the book explains how MATLAB and Octave are powerful tools applicable to a variety of problems. 1 No Time to Lose: Time Series Analysis Have you ever wondered what the weather, financial prices, home energy usage, and your weight all have in Not obvious? Oh... well, read on! common? Well, appart from the obvious, the data to analyse these phenomena can be collected at regular intervals over Or is it Toulouse, like “Toulouse” time. Common sense, right? Well, there is no time to lose; in France? let us take a deeper look into this exciting kind of data. Are you ready? A time series is defined as a sequence of data reading in successive order and can be taken on any variable that changes over time. So, if a time series is a set of data collected over time, then a lot of things, not just our weight or the weather, would be classed as time series, and perhaps A lot of data is collected over time, that is true. There are, obviously and quite literally, millions but that does not make the data of data points that can be collected over time. However, time series analysis is not necessarily immediately employed. Time series analysis encapsulates the methods used to understand the sequence of data points mentioned above set a time series. 2 j. rogelsalazar and extract useful information from it. A main goal is that of forecasting successive future values of the series. In this chapter we will cover some of these methods. Let us take a look. 1.1 Time Series Knowing how to model time series is surely an important tool in our Jackalope data scientist toolbox. Jackalopes? Yes! Long story... You can get further information in Chapter 1 of Data Science and Analytics with Python.1 . But I digress, the key point about time series data is that the ordering of the data points in time matters. For many datasets it is not important in which order the data RogelSalazar, J. (2017). Data Science and Analytics with Python. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. CRC Press 1 are obtained or listed. One order is as good as another, and although the ordering may tell us something about the dataset, it is not an inherent attribute of the set. See for instance the datasets analysed in the book mentioned above. However, for time series data the ordering is absolutely crucial. The order imposes a certain structure on the data, What is different about time which in turn is of relevance to the underlying phenomenon series? —Time! studied. So, what is different about time series? Well, Time! Furthermore, we will see later on in this chapter that in some cases there are situations where future observations are influenced by past data points. All in all, this is not a surprising statement; we are well acquainted with causality relationships. Let us have a look at an example of a time series. In Figure 1.1 we can see a financial time series corresponding to the advanced data science and analytics with python 3 log Returns 0.04 0.02 0.00 0.02 0.04 5 70 201 07 7 201 9 70 201 1 71 201 Date 1 80 201 log returns of Apple for a year starting in April 2017. The 3 80 201 5 80 201 Figure 1.1: A time series of the log returns for Apple Inc. for a year since April 2017. log returns are used to determine the proportional amount you might get on a given day compared to the previous one. With that description in mind, we can see how we are relating the value on day n to the one on day n − 1. The log return is given by FV log PV , where FV is the future value and PV is the past value. In that way, a Jackalope data scientist working in finance may be able to look at the sequence provided by the time series to determine a model that can predict what the next value will be. The same train of thought will be applicable to a variety of other human endeavours, from agriculture to climate change, and from geology to solar dynamics. In contrast, in many other cases the implicit assumption we may be able to make is that the observations we take are not a sequence and that the values obtained are independent from each other. Let us consider the Iris dataset that we have used in Chapter 3 of Data Science and Analytics with And hop all the way to the bank... 4 j. rogelsalazar Python2 . The dataset records measurements of three species of iris flowers in centimetres, including sepal length, sepal width, petal length and petal width. In collecting the 2 RogelSalazar, J. (2017). Data Science and Analytics with Python. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series. CRC Press information, there is no reason to believe that the fact the current iris specimen we measure has a petal length of, say, 6.1 cm tells us anything about the next specimen. In a time series the opposite is true, i.e., whatever happens at time t has information about what will happen at t + 1. In t has information about what will that sense, our observations of the phenomenon at hand are happen at t + 1. at the same time both outcomes and predictors: Outcomes of the previous time step, and predictors of the next one. I know what you are thinking—cool!!— and now how do we Isn’t it cool to be able to use deal with that situation!??! interrobangs!??! You will be happy (although not surprised perhaps) that there is an answer: There are various ways to deal with this input/output duality and the appropriate methodology very much depends on what I call the personality of the data, I think data, like humans, has also i.e. the nature of the data itself, how it was obtained and some personality . what answers we require from it. In this chapter we shall see some of the ways we can analyse time series data. Let us start with a few examples. 1.2 One at a Time: Some Examples In the previous section we have seen a first example of a time series given by the log returns of Apple (shown in Figure 1.1). We can clearly see a first maximum on August 2nd, 2017. This corresponds to the day Apple released advanced data science and analytics with python 5 their thirdquarter results for 2017, beating earning and revenue estimates3 . There are several other peaks and troughs during the year of data plotted. These are not uncommon in many financial time series, and not all may have a straightforward explanation like the one above. Archer, S. (2017). Apple hits a record high after crushing earnings (AAPL). http://markets.businessinsider.com /news/stocks/applestockpricerecordhighaftercrushingearnings 20178100222647. Accessed: 20180501 3 Another interesting thing we can notice is that if we were to take the average of the values in the series, we can see that it is a fairly stable measure. Nonetheless, the variability of the An average return of data points changes as we move forwards in time. We shall approximately zero! see later on some models that will exploit these observations to analyse this type of data. Let us see another example from a very different area: Solar dynamics. In Figure 1.2 we can see the number of sunspots per month since 1749 through 2013. The earliest study of the periodicity of sunspots was the work by Schuster4 in 1906. Schuster is credited with coining the concept of antimatter, and as cool as that is, in this case we would like to concentrate on the periodogram analysis he pioneered to Schuster, A. (1906). II. On the periodicities of sunspots. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 206(402412), 69–100 4 establish an approximate 11year cycle in the solar activity. Sunspots indicate intensive magnetic activity in the sun, and we can see in the figure the regular appearance of maximum and minimum activity. Understanding the behaviour of sunspots is important due to their link with solar activity Sunspots are linked to solar and help us predict space weather that affects satellite activity, enabling us to carry out communication and also provides us with aweinspiring and spectacular auroras. If our goal is indeed to generate predictions from the data in a time series, there are certain assumptions that can help space weather predictions. 6 j. rogelsalazar Num. of Sun spots 250 200 150 100 50 0 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 Year us in our quest. A typical assumption made is that there Figure 1.2: Solar activity from 1749 through 2013. is some structure in the time series data. This structure may be somewhat obfuscated by random noise. One way Structure = Trend + Seasonality to understand the structure of the time series is to think of the trend shown in the series together with any seasonal variation. The trend in the Apple log returns discussed earlier on may not be very obvious. Let us take a look at the closing price of the Apple stock during the same period. In Figure 1.3 we can see the behaviour of the closing price for a year since 2017. The plot shows that there is a tendency for the prices to increase overtime. Similarly, there seem to be some periodicity in the data. This brings us to the seasonality in a time series. Seasonality is understood in this case to be the presence of variations Trend, it should be said! Closing Price advanced data science and analytics with python 180 175 170 165 160 155 150 145 70 201 6 8 70 201 0 71 201 12 7201 Date 2 80 201 observed at regular intervals in our data set. These intervals 04 8 201 Figure 1.3: Closing prices for Apple Inc. for a year since April 2017. may be daily, weekly, monthly, etc. Seasonal variation may be an important source of information in our quest for predictability as it captures information that will clearly have an impact on the events you are measuring with your data. The seasonality in the sunspot activity shown in Figure 1.2 is undeniable. 1.3 Bearing with Time: Pandas Series Now that we have a better idea of what makes a time series dataset different from other types of data, let us consider how we can deal and manipulate them in a way that makes life easier for us Jackalope data scientists. I am sure that you have come across the great and useful Python Seasonality is the presence of variations at regular intervals. 7 8 j. rogelsalazar module called Pandas. Its original author, Wes McKinney started developing the module to deal with panel data, encountered in statistics and econometrics5 . Indeed he started using Python to perform quantitative analysis on financial data at AQR Capital Management. Today, Pandas is a wellestablished open source piece of software with McKinney, W. (2011). pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing: O’Reilly Media, Inc 5 multiple uses and a large number of contributors. Since time is an important part of a time series, let us take a look at some data that contains time as one of its columns. A hint is in the name... We can start by loading some useful modules including Pandas and datetime: import numpy as np import pandas as pd from datetime import datetime We can create a dictionary with some sample data: data = {’date’: [’20180101’, ’20180201’, ’20180301’, ’20180401’, ’20180501’, ’20180601’, ’20180101’, ’20180201’, ’20180301’, ’20180401’, We are creating a dataframe with two columns: Date and visitors. Each column is given as a list. ’20180501’, ’20180601’], ’visitors’: [35, 30, 82, 26, 83, 46, 40, 57, 95, 57, 87, 42]} We have visitor monthly data for January through June 2018. The date is given in a format where the year comes first, followed by the month and the day. This dictionary can be readily converted into a Pandas dataframe as follows: The date is given in the format ’YYYYMMDD’. advanced data science and analytics with python 9 df = pd.DataFrame(data, columns=[’date’, ’visitors’]) Let us take a look at the data: > df.head() date visitors 0 20180101 35 1 20180201 30 2 20180301 82 3 20180401 26 4 20180501 83 As expected, we have a dataframe with two columns. Notice that when looking at the dataset, the rows have been given a number (starting with 0). This is an index for the A very Pythonic way of counting. dataframe. Let us take a look at the types of the columns in this dataframe: > df.dtypes date object visitors int64 dtype: object The visitors column is of integer type, but the date column is shown to be an object. We know that this is a date and it would be preferable to use a more relevant type. We can change the column with the to_datetime method in a Pandas dataframe: The type for date is object, whereas for visitors is integer. 10 j. rogelsalazar df[’date’] = pd.to_datetime(df[’date’]) We can use the to_datetime method to convert Pandas columns into date objects. Furthermore, since the date provides an order sequence for our data, we can do a couple of useful things. First we can set the index to be given by the date column, and second, we can order the dataframe by this index: df.set_index(’date’, inplace=True) We set an index and sort the df.sort_index(inplace=True) dataframe by that index. We have used the inplace property for both commands above. This property lets us make changes to the dataframe insitu, otherwise we would have to create a new dataframe object. Let us look at the head of our dataset: > df.head() visitors date The inplace property lets us make changes directly to the dataframe. 20180101 35 20180101 40 20180201 30 20180201 57 20180301 82 As we can see in the code above, the rows of the dataset have been ordered by the date index. We can now apply some slicing and dicing to our dataframe. For instance, we can look at the visitors for the year 2018: Otherwise, we would need to make copies of it to apply the changes. advanced data science and analytics with python 11 In this case this would correspond df[’2018’] to all our data points. What about if we were interested in the visitors for May, 2018? Well, that is easy: > df[’201805’] Here we are filtering for the visitors visitors in May, 2018. date 20180501 83 20180501 87 Other slicing and dicing techniques used in collection objects are possible thanks to the use of the colon notation. For instance, we can request all the data from March, 2018 onwards as follows: > df[datetime(2018, 3, 1):] visitors date 20180301 82 20180301 95 20180401 26 20180401 57 20180501 83 20180501 87 20180601 46 20180601 42 The colon notation used in other collection objects in Python works for Pandas time series too. 12 j. rogelsalazar The truncate method can help us keep all the data points before or after a given date. In this case, let us ask for the data up to March 2018: > df.truncate(after=’20180301’) visitors date We can truncate the time series 20180101 35 20180101 40 20180201 30 20180201 57 20180301 82 20180301 95 with the truncate method. Had we used the before parameter instead, we could have truncated all the data points before March, 2018 instead. We can use Pandas to provide us with useful statistics for our dataset. For example, we can count the number of datapoints per entry in the index: > df.groupby(’date’).count() visitors date We can calculate aggregations with the help of groupby. In this case we are interested in the 20180101 2 20180201 2 20180301 2 20180401 2 20180501 2 20180601 2 count. advanced data science and analytics with python As expected, we have two entries for each date. We can also look at statistics such as the mean and the sum of entries. In this case, we are going to use the resample method for a The resample method lets us series. In effect this enables us to change the time frequency change the frequency in our dataset. in our dataset. Let us use the ’M’ offset alias to tell Pandas to create monthly statistics. For the mean we have: > df.resample(’M’).mean() visitors date We can calculate the mean. 20180131 37.5 20180228 43.5 20180331 88.5 20180430 41.5 20180531 85.0 20180630 44.0 Similarly, for the sum we have: > df.resample(’M’).sum() visitors date 20180131 75 20180228 87 20180331 177 20180430 83 20180531 170 20180630 88 And the sum too. 13 14 j. rogelsalazar An offset alias, such as ’M’ used in the code above is a Offset aliases are listed in Table string that represents a common time series frequency. We 1.1. can see some of these aliases in Table 1.1. We can even create a plot of the dataset. In this case, we show in Figure 1.4 the monthly sum of visitors for the dataset in question. 180 visitors 160 140 120 100 80 Jan 2018 Feb Mar Apr Jun Date It is possible to obtain descriptive statistics with the use of the describe method, and we can do so per relevant group. For example, we can request the information for each date in the dataset: May Figure 1.4: Total of monthly visitors for the data entered manually. advanced data science and analytics with python Alias Description B C D W M business day frequency custom business day frequency calendar day frequency weekly frequency monthend frequency semimonthend frequency (15th and end of month) business monthend frequency custom business monthend frequency monthstart frequency semimonthstart frequency (1st and 15th) business month start frequency custom business monthstart frequency quarterend frequency business quarterend frequency quarter start frequency business quarterstart frequency yearend frequency business yearend frequency yearstart frequency business yearstart frequency business hour frequency hourly frequency minutely frequency secondly frequency milliseconds microseconds nanoseconds SM BM CBM MS SMS BMS CBMS Q BQ QS BQS A, Y BA, BY AS, YS BAS, BYS BH H T, min S L, ms U, us N 15 Table 1.1: Offset aliases used by Pandas to represent common time series frequencies. 16 j. rogelsalazar df.groupby(’date’).describe() In Table 1.2 we see the descriptive statistics for the data entered manually earlier on. For brevity we have decided not to include the count column. Visitors mean std min 25% 50% 75% max 37.5 88.5 41.5 85.0 44.0 35.0 82.0 26.0 83.0 42.0 36.25 85.25 33.75 84.00 43.00 37.5 88.5 41.5 85.0 44.0 38.75 91.75 49.25 86.00 45.00 40.0 95.0 57.0 87.0 46.0 Table 1.2: Descriptive statistics for the data entered manually. We are not including the count in this table. date 20180101 20180301 20180401 20180501 20180601 3.53 9.19 21.92 2.82 2.82 Given that date and time are important components of a time series, Pandas has some neat tricks to help us deal with them. For example, it is possible to use date formats such as that shown above, i.e., ’YYYYMMDD’. We can also provide We can provide a data in plain a date in other formats, for instance consider the following natural language, and convert it to a date type. code: > date = pd.to_datetime("14th of October, 2016") > print(date) Timestamp(’20161014 00:00:00’) We have successfully transformed a date given in natural language to a time stamp. We can also do the opposite; in other words, we can obtain a string of the time stamp to tell How cool is that!? advanced data science and analytics with python Directive Meaning %a abbreviated weekday name full weekday name abbreviated month name full month name preferred date and time representation day of the month (1 to 31) same as %m/%d/%y day of the month (1 to 31) month (1 to 12) minute second weekday as a number (Mon=1 to 7) %A %b %B %c %d %D %e %m %M %S %u Table 1.3: Some format directives for the strftime method. us the weekday, month, day, etc. We can do this thanks to the strftime method together with a format directive. Some format directives for strftime are listed in Table 1.3. Let us take a look at extracting the full weekday name (%A), the name of the month (%B) and the weekday number (%u). > date.strftime(’%A’) ’Friday’ > date.strftime(’%B’) ’October’ > date.strftime(’%u’) ’5’ 17 strftime lets us obtain a string out of the time stamp. 18 j. rogelsalazar 1.3.1 Pandas Time Series in Action In some cases we may need to create time series data from scratch. In this section we are going to explore some of the ways in which Pandas enables us to create and manipulate time series data on top of the commands we have discussed up until this point. The first thing to take care of is the time ranges required for We can determine a time range by our data set. For example, we can ask Pandas to create a specifying start and end times. series of dates with date_range: > pd.date_range(’20180530’, ’20180602’) DatetimeIndex([’20180530’, ’20180531’, ’20180601’, ’20180602’], dtype=’datetime64[ns]’, freq=’D’) Note that the output of the command above is an index Recall the time offset aliases covering the time range requested with a daily frequency, as shown in Table 1.1. shown in the output with freq=‘D‘. An alternative to the above command is to provide a start date, but instead of giving an end date, we request a number of “periods” to cover with the time series: > pd.date_range(’20180530’, periods=4) DatetimeIndex([’20180530’, ’20180531’, ’20180601’, ’20180602’], dtype=’datetime64[ns]’, freq=’D’) Alternatively, we can provide a start time and a number of periods. advanced data science and analytics with python 19 This hints to the fact that we can provide a number of periods to cover, as well as the frequency we require. For example, we can request for four monthly periods: > pd.date_range(’20180530’, periods=4, freq=’M’) Here we provide a start time, a number of periods and the frequency for those periods. DatetimeIndex([’20180531’, ’20180630’, ’20180731’, ’20180831’], dtype=’datetime64[ns]’, freq=’M’) As you can see, all we had to do was specify the monthly frequency with freq=’M’. Let us construct a more complicated dataset: For a period of four days starting on June 4, 2018; we take readings for four features called A, B, C and D. In this case we will generate the readings with a random number sampled The random number can be obtained with the method random.randn from numpy. from a standard normal distribution. Let us create some definitions: from numpy.random import randn idx = pd.date_range(’20180604 00:00:00’, periods=4) cols = [’A’, ’B’, ’C’, ’D’] We will now create data for four rows and four columns with the help of randn: data = randn(len(idx), len(cols)) randn(m, n) creates an array of m rows and n columns. 20 j. rogelsalazar With this information, we now create our dataframe. df = pd.DataFrame(data=data, index=idx, columns=cols) Since we used random numbers df.index.name=’date’ to generate the data, the numbers > print(df) shown here will differ from those you may obtain on your computer. A B C D date 20180604 0.025491 20180605 1.378149 1.276321 0.200059 0.747168 0.175478 0.181216 0.601201 20180606 0.640565 0.061296 1.495377 0.042206 20180607 1.300981 1.653624 1.160137 1.909562 A table like the one above is useful to summarise data and it is fit for “human consumption”. However, in many In other words, it is an applications, it is much better to have a “long format” or arrangement that a human will find easy to read and understand. “melted” dataset, i.e., instead of arranging the data in a rectangular format as shown above, we would like all the data readings in a single column. In ordet to achieve this, we need to repeat the dates and we also require a new column to hold the feature to which each reading corresponds. This can easily be done with Pandas This is because we need the date in a single command. The first thing we need to do is reset to be part of the new formatted the index. df.reset_index(inplace=True) In order to melt the dataframe, we will use the melt method that takes the following parameters: A column that will become the new identifier variable with id_vars, the dataset. advanced data science and analytics with python 21 columns to unpivot are specified with value_vars and If no value_vars is provided, all finally the names for the variable and value columns with columns are used. var_name and value_name, respectively: > melted = pd.melt(df, id_vars=’date’, var_name=’feature’, value_name=’reading’) > print(melted) The original columns have become entries in the column called date feature reading 0 20180604 A 0.025491 1 20180605 A 2 20180606 A 0.640565 3 20180607 A 1.160137 4 20180604 B 1.378149 5 20180605 B 0.175478 0.747168 ... 14 20180606 D 0.042206 15 20180607 D 1.653624 We can now set the index and sort the melted dataset: melted.set_index(’date’, inplace=True) melted.sort_index(inplace=True) 1.3.2 Time Series Data Manipulation Let us take a look at some of the manipulations we have described above used in a more real dataset. Remember the time series for Apple Inc. returns discussed in Section 1.2? “features” and the values are in the column “reading”. 22 j. rogelsalazar Well, we will delve a bit more into that data. The dataset is available at6 https://doi.org/10.6084/m9.figshare. 6339830.v1 as a commaseparated value file with the name “APPL.CSV”. As usual, we need to load some libraries: 6 RogelSalazar, J. (2018a, May). Apple Inc Prices Apr 2017 Apr 2018. https://doi.org /10.6084/m9.figshare.6339830.v1 import numpy as np import pandas as pd We need to load the dataset with the help of Pandas; in this case, with the read_csv method: appl = pd.read_csv(’APPL.CSV’) appl.Date = pd.to_datetime(appl.Date, Make sure that you pass on the correct path for the file! format=’%Y%m%d’) In the first line of the code above, we have used the read_csv method in Pandas to load our dataset. We know that the column called “Date” should be treated as datetime and hence we use to_datetime to make that conversion. We are using to_datetime to Please note that we are also giving Pandas a helping hand ensure that dates are appropriately by telling it the format in which the date is stored, in this case as year, followed by month and day. The dataset contains open, high, low and close (i.e., OHLC) prices for Apple Inc. stock between April 2017 and April 2018. We are going to concentrate on the “Close” column, but before we do that, we need to ensure that the dataset is indexed by the time stamps provided by the “Date” column. We can easily do that with the set_index method as follows: typed. advanced data science and analytics with python appl.set_index(’Date’, inplace=True) 23 We set up the index with set_index(). We can take a look at the closing prices: > appl[’Close’].head(3) Date We centre our attention on the use 20170425 144.529999 20170426 143.679993 20170427 143.789993 of the closing prices. Notice that although we only requested Python to give us a look at the Close column, the printout obtained added automatically the index given by the dates. The data provided is already ordered; however, in case we are dealing with data where the index is not in the correct order, we can use sort_index: Sorting by the index is done with df.sort_index(inplace=True) sort_index(). The daily closing prices can be used to calculate the return at time t for example, this can be expressed as: Rt = Pt − Pt−1 , Pt−1 (1.1) Effectively a percentage change. where Pt is the price at time t and Pt−1 is the price at the previous time period. We can apply this calculation in a very easy step in Pandas as follows: appl[’pct_change’] = appl.Close.pct_change() We can see the result of this calculation: We are using pct_change() to calculate the returns. 24 j. rogelsalazar > appl[’pct_change’].tail(3) The percentage change from one 20180423 0.002896 20180424 0.013919 20180425 0.004357 day to the next is easily calculated. Continuous compounding of returns leads to the use of log returns and as mentioned in Section 1.2 they are calculated as follows: rt = log(1 + Rt ) = log Pt Pt−1 Continuous compounding of = log( Pt ) − log( Pt−1 ). (1.2) returns leads to the use of log returns. We need to calculate the logarithm of the price at each time t and then take the difference between time periods. We can certainly do this in Python, and Pandas gives us a helping hand with the diff() method: appl[’log_ret’] = np.log(appl.Close).diff() The diff method calculates the difference from one time period to the next. We can check the result of this operation by looking at the values in the new column we have created: > appl[’log_ret’].tail(3) We are looking at the last three 20180423 0.002901 20180424 0.014017 20180425 0.004348 This is the data that we show in Figure 1.1, and indeed this is the way we calculated that time series shown in the figure. entries in our table. advanced data science and analytics with python 25 It is fairly common to have financial data series like the one we have used above, where the frequency is given A “tick” is a measure of the by the end of day prices. However, the frequency can be minimum upward or downward different, for instance given by the minimum upward or movement in the price of a downward price movement in the price of a security. This security. is known as a tick. Let us take a look at tick data for the Bitcoin/USD exchange rate. The dataset is available at7 https://doi.org/10.6084/m9.figshare.6452831.v1 as a commaseparated value file with the name bitcoin_usd.csv, RogelSalazar, J. (2018b, Jun). Bitcoin/USD exchange rate Mar 31Apr 3, 2016. https://doi.org /10.6084/m9.figshare.6452831.v1 7 and it contains prices for covering tick data between March 31 and April 3, 2016. We can read the data in the usual way. However, if we were to inspect the data, we will notice that the date is stored in a Pro tip: Inspect your data before column called time_start, and that the format is such that importing it, it will save you a few headaches! the day is placed first, followed by the month and the year; the time in hours and minutes is provided. We can use this information to create a rule to parse the date: parser = lambda date: pd.datetime.\ strptime(date, ’%d/%m/%Y %H:%M’) We can now provide extra information to Pandas to read the data and parse the dates at the same time: fname = ’bitcoin_usd.csv’ bitcoin = pd.read_csv(fname, parse_dates=[’time_start’], date_parser=parser, index_col=’time_start’) We specify the columns to be parsed and how they shall be parsed! 26 j. rogelsalazar Notice that we are specifying what columns need to be parsed as dates with parse_dates and how the parsing should be performed with date_parser. We also load the dataset indicating which column is the index. Let us concentrate now on the closing price and the volume: ticks = bitcoin[[’close’, ’volume’]] We are effectively creating a new dataframe called ticks. The data is roughly on a minutebyminute frequency. We can use Pandas to resample the data at desired intervals. For instance we can request for the data to be sampled every five minutes and take the first value in the interval: > ticks.resample(’5Min’).first() close volume We can resample our data with the time_start help of resample(). 20160331 00:00:00 413.27 8.953746 20160331 00:05:00 413.26 0.035157 20160331 00:10:00 413.51 43.640052 ... We can also ask for the mean, for example > ticks.resample(’5Min’).mean() close volume We can specify how the time_start resampling will be performed. 20160331 00:00:00 413.270 2.735987 20160331 00:05:00 413.264 2.211749 20160331 00:10:00 414.660 37.919166 ... advanced data science and analytics with python 27 In this way we could get the closing price for the day by resampling by day and requesting the last value: > ticks.resample(’D’).last() close volume The closing for the new time_start resampling interval can be 20160331 416.02 0.200000 20160401 417.90 52.099684 20160402 420.30 0.850000 obtained from the last value. ... Now that we know how to resample the data, we can consider creating a new open, high, low and close set of prices for the resampled data. Let us do this for the fiveminute bars: > bars = ticks[’close’].resample(’5Min’).ohlc() open high low close time_start The ohlc() method lets us find the OHLC prices for our new sampled 20160331 00:00:00 413.27 413.27 413.27 413.27 20160331 00:05:00 413.26 413.28 413.25 413.28 20160331 00:10:00 413.51 414.98 413.51 414.98 Pandas will take the first and last values in the interval to be the open and close for the bar. Then it will take the max and min as the high and low, respectively. In this way, we start filtering the data. For example, imagine we are interested in the prices between 10 am and 4 pm each day: data. 28 j. rogelsalazar > filtered = bars.between_time(’10:00’, ’16:00’) Notice the use of between_time to open high low close 20160331 10:00:00 416.00 416.00 415.98 415.98 20160331 10:05:00 415.98 415.98 415.97 415.97 20160403 15:55:00 421.01 421.02 421.00 421.00 20160403 16:00:00 421.01 421.01 421.01 421.01 filter the data. time_start ... We may be interested in looking at the price first thing in the morning — say 8 am: > bars.open.at_time(’8:00’) time_start In this case we are using the 20160331 08:00:00 416.11 20160401 08:00:00 416.02 20160402 08:00:00 420.69 20160403 08:00:00 418.78 at_time method. Not only that, we can request the percentage change too by combining the methods we have already discussed: > bars.open.at_time(’8:00’).pct_change() time_start And the methods can be easily 20160331 08:00:00 NaN 20160401 08:00:00 0.000216 20160402 08:00:00 0.011225 20160403 08:00:00 0.004540 combined! advanced data science and analytics with python 29 Please note that the first percentage change cannot be calculated as we do not have a comparison data point from the previous interval. In this case, Pandas indicates this by the use of NaN. If we inspect the data with a bit more detail, we will see that for the last part of April 3, the frequency is such that we have some missing bars when sampling at fiveminute intervals: > bars.tail() open high low close time_start In many cases we may find that 20160403 23:35:00 420.6 420.6 420.6 420.6 we have some missing data in our 20160403 23:40:00 NaN NaN NaN NaN 20160403 23:45:00 NaN NaN NaN NaN 20160403 23:50:00 420.6 420.6 420.6 420.6 20160403 23:55:00 421.0 421.0 420.6 420.6 datasets... We can fill in missing data with the help of fillna, which takes a parameter called method. It can be either ’pad’ or We can fill in missing data with ’ffill’ to propagate last valid observation forward; or the help of fillna(). instead either ’backfill’ or ’bfill’ to use the next valid observation to fill the gap. We can also limit the number of consecutive values that should be filled in with limit. For instance we can fill only one gap by propagating the last valid value forward: 30 j. rogelsalazar > bars.fillna(method=’ffill’, limit=1) Here we have filled the missing ... data by bringing the last value 20160403 23:35:00 420.60 420.60 420.60 420.60 20160403 23:40:00 420.60 420.60 420.60 420.60 20160403 23:45:00 NaN NaN NaN NaN 20160403 23:50:00 420.60 420.60 420.60 420.60 20160403 23:55:00 421.00 421.00 420.60 420.60 forward and limitting the operation to one time period. Let us fill both gaps and create a new dataframe: filledbars = bars.fillna(method=’ffill’) For the volume it would make sense to consider the sum of all the securities traded in the fiveminute interval: volume = ticks.volume.resample(’5Min’).sum() vol = volume.fillna(0.) A plot of the open, high, low and close prices for the fiveminute bars, together with the corresponding volume for the 3rd of April between 9 am and 11.59 pm is shown in Figure 1.5 and can be created as follows: filledbars[’20160403’].between_time(’9:00’,\ ’23:59’) .plot(\ color=[’gray’,’gray’,’gray’,’k’], style=[’’,’’,’.’,’+’]) The plotting commands that we know and love are available to the Pandas series and dataframes too. vol[’20160403’].between_time(’9:30’,’23:59’)\ .plot(secondary_y=True, style=’ko’) advanced data science and analytics with python 31 350 open high low close 300 421 250 Price 200 150 Volume 420 419 100 50 418 0 09:00 1.4 12:00 15:00 Time 18:00 Modelling Time Series Data 21:00 00:00 04Apr Figure 1.5: Open, high, low and close prices for the exchange rate of bitcoin/USD. We know that there is no such a thing as a perfect model, just good enough ones. With that in mind, we can start thinking about the assumptions we can make around data There is no such thing as a perfect in a time series. We would like to start with a simple model, model... just good enough ones. and perhaps one of the first assumptions we can make is that there is no structure in the time series. In other words, we have a situation where each and every observation is in effect an independent random variate. 32 j. rogelsalazar A good example of this would be white noise. In this case White noise is whose intensity is when facing this type of signal the best we can do is simply the same at all frequencies within a given band. predict the mean value of the dataset. Let us create some white noise in Python with the help of numpy: import numpy as np import pandas as pd white = 2*np.random.random(size=2048)1 white = pd.Series(white) In the code above, we are using the random method in numpy.random to draw samples from a uniform distribution. We would like our samples to be drawn from Uni f [ a, b) with a = −1 and b = 1 so that we have white noise with Hence the use of (b − a)(sample) + mean zero. A plot for one such time series is shown in a. Figure 1.6. Remember that we are assuming that each observation is We are keeping it simple. independent from the other. If there is correlation among the values of a given variable, we say that the variable is autocorrelated. For a repeatable (random) process X, let Xt be the realisation of the process at time t; also let the process have mean µt and variance σt2 . The autocorrelation R(s, t) between times t and s is given by: R(s, t) = E[( Xt − µt )( Xs − µs )] , σt σs where E[·] is the expectation value. Autocorrelation provides us with a measure of the degree of similarity (1.3) Autocorrelation. advanced data science and analytics with python 33 1.00 0.75 0.50 I 0.25 0.00 0.25 0.50 0.75 1.00 0 500 1000 t 1500 between the values of a time series and a lagged or shifted 2000 Figure 1.6: White noise with zero mean, constant variance, and zero correlation. version of that same series. Notice that we can recover the usual correlation definition for the case where Xt and Xs are two random variables not drawn from the same process at lagged times. Therefore, as with correlation, the values returned by an autocorrelation calculation lie between −1 and 1. It is also important to mention that autocorrelation gives us information about the existence of linear relationships. Even when the autocorrelation measure is close to zero, there may be a nonlinear relationship between the values of a variable and a lagged version of itself. Let us calculate the autocorrelation for our generated white noise: Autocorrelation values lie between −1 and 1. 34 j. rogelsalazar > for lag in range(1,5): print("Autocorrelation at lag={0} is {1}".\ format(lag, white.autocorr(lag))) Autocorrelation can be calculated Autocorrelation at lag=1 is 0.027756062237309434 with autocorr. Autocorrelation at lag=2 is 0.017698046805029784 Autocorrelation at lag=3 is 0.016764938190346888 Autocorrelation at lag=4 is 0.03636909301996918 The values returned by autocorr are the same as those we would obtain if we calculated the correlation of the time series with a shifted version of itself. Take a look: > print(white.corr(white.shift(1))) As we can see the result is the 0.027756062237309434 Here shift(n) translates the series by n periods, in this case 1, enabling us to calculate the autocorrelation value. Finally, predicting (or calculating) the mean value can be readily done as follows: > print(white.mean()) 0.019678911755368275 1.4.1 Regression. . . (Not) a Good Idea? We have seen how to deal with processes that have no inherent structure, and hence the predictions we can make same. advanced data science and analytics with python 35 are quite straightforward. Let us take a step forward and consider more interesting processes. If we were to compare And boring ones, for that matter. the time series for the closing prices of the Apple stock shown in Figure 1.3 with the white noise we generated for Figure 1.6, we can clearly see that there is indeed more structure in the price data: There are peaks and troughs and Price we can even notice an upward trend. 180 175 170 165 160 155 150 145 Closing Price Trend 70 201 6 8 70 201 0 71 201 2 71 201 Date 2 80 201 We are familiar with some techniques such as multivariate 4 80 201 Figure 1.7: Closing prices for Apple Inc. for a year since April 2017 and a trend line provided by a multivariate regression. regression, and it may be conceivable to apply these techniques to the data we have. At the very least, it may provide us with an idea of the trend in the time series. Ignoring seasonal variation and random noise, we can fit a polynomial model to the data as show in Figure 1.7. We can see the general trend in the set. But is this really a suitable model? Regression may provide us with an idea of the trend. 36 j. rogelsalazar It is hard to believe that the closing price of the Apple stock is simply a function of the calendar date!! It is more likely that the prices are a function of their own history, As well as market forces, product and therefore we require methods that are able to capture announcements, etc. precisely this assumed dependency, and given the results decide whether the model is fit for purpose. We will tackle some models to achieve this in the rest of this chapter. 1.4.2 Moving Averages and Exponential Smoothing We are interested in finding a model that is able to forecast the next value in our time series data. In the We are interested in creating a previous section we have seen how we can make some forecast. assumptions about the data we have and use that to our advantage. In the example with the Apple Inc. prices, we have been able to fit a regression model to the data, but surely we can do better than that. What about if we are able to forecast the future value based on the past values of the time series? For example, we may be able to take the average of the last n observations as In moving averages, the forecast is provided by the simple mean over a period of time. the forecast for the next time period. This methodology is known as moving averages. For example, in the case where An alternative name for moving n = 3, the smoothened value at time t, st , will be given by: averages is rolling averages. st = x t −2 + x t −1 + x t . 3 (1.4) We can also consider giving greater importance to more recent past values than older ones. It sounds plausible, Exponential smoothing works by right? Well, this is actually what exponential smoothing weighting past observations. advanced data science and analytics with python 37 enables us to do. The weighting is performed via constant values called smoothing constants. The simplest method is appropriately called simple exponential smoothing (SES) and it uses one smoothing constant, α. In SES, we start by setting s0 to x0 and subsequent periods at time t are given by: st = αxt + (1 − α)st−1 , (1.5) The simple exponential smoothing method. with 0 ≤ α ≤ 1. The smoothing is a function of α; we have a quick smoothing when α is close to 1, and a slow one when it is close to 0. We choose the value of α such that the mean of the squared errors (MSE) is minimised. We can calculate moving averages and exponential smoothing on a time series with Pandas. For moving We can use Pandas to calculate averages, we simply use the rolling method for Pandas moving averages and exponential smoothing. dataframes. In the case of the Apple Inc. closing prices we have been investigating, we can write the following: appl[’MA3’]=appl[’Close’].rolling(window=3).mean() where we have provided the size of the moving window and indicated that the aggregation of the data will be the mean of the values. For exponential smoothing, Pandas provides the ewm method. We simply pass the parameter α as follows: EWM stands for Exponential Weighted Methods. alpha=0.6 appl[’EWMA’]=apple[’Close’].ewm(alpha=alpha).mean() 38 j. rogelsalazar Price 180 170 160 150 Closing Price Moving Average 201706 201708 201710 201712 201802 201804 Price 180 170 160 150 Closing Price Exponential Smoothing 201706 201708 201710 201712 Date 201802 The method also accepts other definitions such as the centre of mass, the span or the halflife. In Table 1.4 we list the 201804 Figure 1.8: Moving averages (upper panel) and exponential smoothing (lower panel) applied to the closing prices for Apple Inc. relationship between α and these alternative parameters. EWM parameter Definition Centre of Mass (com) α= 1 1+com , Span α= 2 1+span , Halflife α = 1 − exp for com ≥ 0 for span ≥ 1 h log(0.5) halflife i , for halflife > 0 In Figure 1.8 we can see the result of using moving averages and exponential smoothing compared to the closing prices for Apple Inc. Table 1.4: Parameters specifying the decay applied to an exponential smoothing calculation with ewm. advanced data science and analytics with python 1.4.3 39 Stationarity and Seasonality We have been considering some of the assumptions we can make on our data in order to come up with models that A stationary time series is one enable us to understand the underlying phenomena and where its mean, variance and create predictions. One such common assumption is that time. autocorrelation do not change over our time series is stationary. In this context, we say that a process is stationary if its mean, variance and autocorrelation do not change over time. As you can imagine, stationarity can be defined in precise mathematical terms, but a practical way of remembering Effectively a flatlooking series. what we are talking about is effectively a flatlooking series, one where there is no trend and has constant variance over time and without periodic fluctuations or seasonality. Before we continue our discussion about stationarity, let us take a look at seasonality. This can be understood as a cycle that repeats over time, such as monthly, or yearly. This Or any other time interval. repeating cycle may interfere with the signal we intend to forecast, while at the same time may provide some insights into what is happening in our data. Understanding the seasonality in our data can improve our modeling as it enables us to create a clearer signal. In other words, if we are able to identify the seasonal component in our series, we may be able to extract it out leaving us with a A time series with a clear seasonal component which we understand (the seasonal part) plus a component is said to be non clearer relationship between the variables at hand. When we remove the seasonal component from a time series, we end up with a socalled seasonal stationary series. stationary. 40 j. rogelsalazar There are many ways in which we can take a look at the seasonality in a time series. In this case, let us take a look at using the Fast Fourier Transform (FFT) to convert the timedependent data into the frequency domain. This will enable us to analyse if any predominant frequencies exist. In other words, we can check if there is any periodicity on the data. We will not cover the intricate details of the mathematics behind the FFT, but a recommended reading is the excellent Numerical Recipes8 book. Let us take a look at the sunspot data we plotted in Figure 1.2. In that figure we have monthly observations for the sun 8 Press, W., S. Teukolsky, W. Vetterling, and B. Flannery (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press activity. In the analysis below we will resample the data into yearly observations. The data can be found9 at https: //doi.org/10.6084/m9.figshare.6728255.v1 as a comma separated value file with the name “sunspots_month.CSV”. RogelSalazar, J. (2018d, Jul). Sunspots  Monthly Activity since 1749. https://doi.org /10.6084/m9.figshare.6728255.v1 9 After loading the usual modules such as Pandas, we can read the data as follows: sun = pd.read_csv(’sunspots_month.csv’) sun.Year = pd.to_datetime(sun.Year, format=’%Y%m%d’) sun.set_index(’Year’, inplace=True) We are specifying the format in which the dates should be parsed. We also indicate which column is the index in our dataset. As we mentioned before, we have monthly data and we would like to take a yearly view. The first thing we are going to do is obtain a yearly average: While loading the data, we can specify the format for reading the date. advanced data science and analytics with python sun_year = sun.resample(’Y’).mean() 41 We are resampling the data to a yearly frequency. Let us now load the FFT pack from scipy: from scipy import fftpack Fast Fourier transform capabilities are part of fftpack in scipy. Given the signal of the yearly sunspot activity we can calculate its Fourier transform. We also calculate a normalisation constant n: Y=fftpack.fft(sun_year[’Value’]) We calculate the FFT of the signal n=int(len(Y)/2) and a normalisation constant. With this information we can create an array to hold the frequencies in the signal, with the period being the inverse frequency: freq=np.array(range(n))/(2*n) period=1./freq With this information we can obtain the period. We can now calculate the power spectrum of the signal as follows: power=abs(Y[1:n])**2 And finally the power spectrum of the signal. A plot of the power spectrum versus the period is shown in Figure 1.9 where we can see that the sunspot activity data is periodic, and that the sunspots occur with a maximum in activity approximately every 11 years. Cool! 42 j. rogelsalazar 1e7 1.4 1.2 FFT2 1.0 0.8 0.6 0.4 0.2 0.0 0 1.4.4 5 10 15 Period (Year) 20 Determining Stationarity 25 30 Figure 1.9: Analysis of the power spectrum of the sunspots data. We can see that a maximum in activity occurs approximately every 11 years. We have seen that there is seasonality in our sunspot data and so, it is a nonstationary time series. In other cases we may need to check that the mean and variance are constant and the autocorrelation is timeindependent. We can do some of these checks by plotting rolling statistics Rolling statistics can help us to see if the moving average and/or moving variance vary determine stationarity. with time. Another method is the DickeyFuller test which is a The DickeyFuller tests enables us statistical test for checking stationarity. In this case the null to check for stationarity too. hypothesis is that the time series is nonstationary based on the results of a test statistic, and critical values for different confidence levels. If the test statistic is below the critical value, we can reject the null hypothesis and say that the series is stationary. advanced data science and analytics with python Num. of Sun spots 175 43 Sunspot activity Rolling Mean Rolling Standard Deviation 150 125 100 75 50 25 0 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 Year Let us see what this means for our monthly sunspot activity Figure 1.10: Sunspot activity and rolling statistics for the average and the standard deviation. data. We can calculate rolling statistics for the mean and variance: rolling_mean = sun_year[’Value’].rolling(2).mean() Rolling statistics for a window of 2 rolling_std = sun_year[’Value’].rolling(2).std() years. In Figure 1.10 we can see the rolling statistics for the sunspot activity. The variation on the average is larger than that on the standard deviation, but they do not seem to be increasing or reducing with time. Let us take a look at the DickeyFuller test. In this case we are going to use the adfuller method in statsmodels time series analysis module tsa.stattools: The DickeyFuller test can be evaluated with the help of from statsmodels.tsa.stattools import adfuller df_test = adfuller(sun_year[’Value’],autolag=’AIC’) adfuller method in statsmodels. 44 j. rogelsalazar We can now take a look at the results of the DickeyFuller test with the following function: def isstationary(df_test): stationary=[] print(’Test Statistic is {0}’.format(df_test[0])) print(’pvalue is {0}’.format(df_test[1])) print(’No. lags used = {0}’.format(df_test[2])) print(’No. observations used = {0}’.\ The DickeyFuller test implementation returns 4 items including the test statistic, the pvalue, the lags used and the critical values. format(df_test[3])) for key, value in df_test[4].items(): print(’Critical Value ({0}) = {1}’.\ format(key, value)) if df_test[0]<=value: stationary.append(True) else: stationary.append(False) return all(stationary) Let us look at the results: > isstationary(df_test) Test Statistic is 2.4708472868362916 pvalue is 0.12272956184228762 No. lags used = 8 No. observations used = 256 Critical Value (1%) = 3.4561550092339512 Critical Value (10%) = 2.5728222369384763 Critical Value (5%) = 2.8728972266578676 False In this case we see that the DickeyFuller test applied to the sunspots data supports the null hypothesis. advanced data science and analytics with python 45 As we can see, we cannot reject the null hypothesis and the yearly data for the sunspot activity is therefore nonstationary. Extracting the trend and seasonality out of the time series data provides us with better ways to understand the process at hand. A useful technique for this is to decompose the time series into those components that are amenable to Systematic components are those that can be modelled. be described by a model. Given a time series Yt , a naïve additive model decomposes the signal as follows: Yt = Tt + St + et , (1.6) Additive decomposition. where Tt is the trend, St is the seasonality and et corresponds to the residuals or random variation in the series. An alternative to this decomposition is the socalled multiplicative model: Yt = ( Tt ) (St ) (et ) . (1.7) Multiplicative decomposition. We can use seasonal_decompose from statmodels to decompose the signal using moving averages. Since we have seen that we have a seasonality of around 11 years, we will use this information to decompose our time series: import statsmodels.api as sm seasonal_decompose lets us dec_sunspots = sm.tsa.seasonal_decompose(sun_year,\ decompose a time series into its model=’additive’, freq=11) We can use a multiplicative method by passing on the parameter model=’multiplicative’. The result of the systematic and nonsystematic components. 46 j. rogelsalazar decomposition is an object that has a plotting method. We can look at the result of the decomposition by typing dec_sunspots.plot() and the output can be seen in Figure 1.11 where we have plots for the trend, seasonality and the residuals. Trend 80 60 40 Seasonality 20 20 0 20 Residuals 100 50 0 50 1770 1820 1870 Date 1920 Let us apply the DickeyFuller test to the bitcoin tick data 1970 2020 Figure 1.11: Trend, seasonality and residual components for the sunspot dataset. defined on page 26. Let us resample the data on 15minute intervals and take the average: closing_bitcoin=ticks[’close’].\ resample(’15Min’).mean() df_test_bitcoin = adfuller(closing_bitcoin,\ autolag=’AIC’) The bitcoin data is nonstationary. advanced data science and analytics with python 47 Trend 420 419 418 417 Seasonality 416 1.0 0.5 0.0 0.5 Residuals 2 1 0 1 2 31 01 Apr 2016 02 03 04 Date Figure 1.12: Trend, seasonality and residual components for the bitcoin dataset. > isstationary(df_test_bitcoin) Test Statistic is 1.4531293932585607 pvalue is 0.5565571771135377 No. lags used = 10 No. observations used = 373 Critical Value (1%) = 3.448003816652923 Critical Value (5%) = 2.86931999731073 Critical Value (10%) = 2.5709145866785503 False We can see the decomposition in Figure 1.12. 48 j. rogelsalazar 1.4.5 Autoregression to the Rescue So far, we have been doing O.K. with the time series we have seen. However, we know that simply using a linear or polynomial fit to the data is not good enough. Furthermore, We have seen that using a linear or we cannot ignore the seasonal variation and the random polynomial fit is not good for time series. noise that makes up the signal. When we discussed the idea of moving averages, we considered that a better approach was to see if the next value in the series can be predicted as some function of its previous values. A way to achieve this is autoregression. So, Autoregression is exactly what it we are therefore interested in building a regression model of sounds like: A regression on the dataset itself. the current value fitted on one (or more) previous values called lagged values. This sounds great, but how many lagged values do we need? Well, we can take a look at the time series and check how much information there is in the previous values, helping us with our prediction. We can do this with the help of the autocorrelation function (ACF) we defined in Equation (1.3). Similarly, we can look at the partial autocorrelation function (PACF) which controls the values of the time series at all shorter lags, unlike the autocorrelation. The correlation function will test whether adjacent observations are autocorrelated; in other words, it will help us determine if there are correlations between observations 1 and 2, 2 and 3, ... n − 1 and n. Similarly, it will test at other lags. For instance, the autocorrelation at lag 4 tests whether observations 1 and 5, 2 and 6,... are correlated. This is known as “lagone autocorrelation”. advanced data science and analytics with python 49 In general, we should test for autocorrelation at lags 1 to n/4, where n is the total number of observations in the analysis. Estimates at longer lags have been shown to be statistically unreliable10 . We can take a look at the autocorrelation and partial autocorrelation for the sunspot dataset with the following 10 Box, G. and G. Jenkins (1976). Time series analysis: forecasting and control. HoldenDay series in time series analysis and digital processing. HoldenDay code: sm.graphics.tsa.plot_acf(sun_year, lags=40) sm.graphics.tsa.plot_pacf(sun_year, lags=40) Figure 1.13 shows the result of the code above. We can see in the upper panel that the autocorrelation shows a periodic structure, reflecting the seasonality in the time series. Figure 1.13: Autocorrelation and partial autocorrelation for the sunspot dataset. 50 j. rogelsalazar A similar computation can be carried out for the bitcoin Figure 1.14: Autocorrelation and partial autocorrelation for the bitcoin dataset. dataset. The result can be seen in Figure 1.14. As we can see in the upper panel, the correlation fades slowly as we take longer and longer lagged values. It stands to reason that if value 0 is correlated with value 1, and value 1 is correlated with 2, it follows that 0 must be correlated with 2. This is why we need the partial Partial autocorrelation tells autocorrelation, as it provides us with information about the us about the relationship of relationship between an observation with observations at but without the intervening prior time steps, but with the crucial difference that the observations. intervening observations are removed. observations with earlier ones, advanced data science and analytics with python 51 In the examples of the sunspot and bitcoin datasets, we can see from the lower panels of the correlograms in Figures A correlogram is a plot showing 1.13 and 1.14 that only the most recent values are really the correlation statistics. useful in building an autoregression model. A PACF correlogram with a large spike at one lag that decreases after a few lags usually indicates that there is a moving average term in the series. In this case, the autocorrelation function will help us determine the order of the moving average term. If instead we have a large spike at one lag followed by a damped oscillating correlogram, then A large spike followed by damped we have a higher order moving average term. This is the oscillations indicates a higher order moving average term. picture we get from the lower panel of Figure 1.13 for the sunspot data. In the case of the correlogram shown in the lower panel of Figure 1.14, we have a few important correlations in the first few lags that die out quite quickly. In this case we can A spike on the first lags followed interpret this as having a time series with an autoregressive by not very important ones term. We can determine the order of this autoregressive autoregressive term. term from the spikes in the correlogram. We will discuss autoregressive models in the following section. 1.5 Autoregressive Models An autoregressive (AR) model is a representation of a type of random process where the future values of the series are based on weighted combinations of past values. As such, an AR(1) is a firstorder process in which the current value suggests the presence of an 52 j. rogelsalazar is based only on the immediately previous value: Yt = β 0 + β 1 Yt−1 + et . (1.8) AR(1) model. Yt = β 0 + β 1 Yt−1 + β 2 Yt−2 + et , (1.9) AR(2) model. An AR(2) process, determines the current value based on the previous two values, and so on. It is possible to use autoregression and moving averages in combination to describe various time series. This methodology is usually called autoregressive moving average (ARMA) modelling. In ARMA modelling we use ARMA  Autoregressive Moving two expressions to describe the time series, one for the Average. moving average and the other one for the autoregression. ARMA( p, q) denotes a model with autoregression of order p and moving average of order q. A further generalisation of an ARMA model is the socalled autoregressive integrated moving average or ARIMA model. The AR and MA parts of the acronym follow the ARIMA  Autoregressive discussion above. The integrated (or “I”) part is perhaps less Integrated Moving Average. clear, but effectively it means that the time series has been rendered stationary by taking differences. In other words, instead of looking at the observation Y1 we are interested in Y1 − Y0 . An ARIMA(p, d, q) model puts together all the techniques we have discussed in this chapter and is specified by three parameters: p, d, and q, where: advanced data science and analytics with python • p: Denotes the order of the autoregression • d: Denotes the number of difference levels 53 The meaning of the parameters in an ARIMA( p, d, q) model. • q: Denotes the order of of moving average We have some commonly used models, such as: • ARIMA(0, 0, 0) is simply predicting the mean of the overall time series. In other words, there is no structure! Some common ARIMA models. • ARIMA(0, 1, 0) works out the differences (not the raw values) and predicts the next one without autoregression or smoothing. This is effectively a random walk! Let us take a look at applying ARMA and ARIMA models to the sunspot dataset. For instance we can apply an ARMA(9, 0) model as follows: arma_sun = sm.tsa.ARMA(sun_year, (9, 0)).fit() print(arma_sun.params) const 50.466706 ar.L1.Value 1.161912 ar.L2.Value 0.387975 ar.L3.Value 0.179743 ar.L4.Value 0.148018 ar.L5.Value 0.098705 ar.L6.Value 0.036090 ar.L7.Value 0.014294 ar.L8.Value 0.055000 ar.L9.Value 0.226996 The results of applying an ARMA(9, 0) model to the sunspot dataset. 54 j. rogelsalazar The best model can be found by changing the parameters p and q of the model such that we minimise any of the See Appendix A for more details various information criteria such as the Akaike (AIC), the about these information criteria. Bayesian (BIC) or the HannanQuinn (HQIC) information criterion. print("AIC: ", arma_sun.aic) print("BIC: ", arma_sun.bic) Evaluation of the AIC, BIC and print("HQIC:", arma_sun.hqic) AIC: 2230.4154805952835 BIC: 2269.792508681132 HQIC information criteria. HQIC: 2246.236568447941 We can also apply an ARIMA model, in this case an ARIMA(9, 1, 0): arima_mod= ARIMA(sun_year, order=(9,1,0)).fit() > print(arima_mod.summary()) An abridged version of the ARIMA Model Results summary provided by the ================================================= Dep. Variable: D.Value No. Observations: Model: Log Likelihood Method: ARIMA(9, 1, 0) cssmle 264 1103.368 S.D. of innovations 15.716 ... AIC 2228.736 BIC 2268.072 HQIC 2244.542 ================================================= ARIMA(9, 1, 0) model applied to the sunspot dataset. advanced data science and analytics with python 200 55 Sunspot activity Prediction 175 Num. of Sun spots 150 125 100 75 50 25 0 1970 1980 1990 Year 2000 Finally, it is important to note that in the ideal scenario 2010 2020 Figure 1.15: Prediction for the sunspot activity using an ARMA(9, 0) model. we would carry out the analysis on a training dataset to develop a predictive model to be tested against a testing set. Nonetheless, let us take a look at the predictions we could draw, in this case for the ARMA model above: predict_sunspots = arma_sun.predict(’1980’,\ ’2050’, dynamic=True) The result can be seen in Figure 1.15 where we can compare the actual values of the sunspot activity against the predictions made by the model for the years between 1980 and 2020. Not bad for a model that has not been curated!! We can run predictions from the models with the predict method for each of them. 56 1.6 j. rogelsalazar Summary In this chapter we addressed some important aspects of dealing with time series data and no Jackalope data scientist must be without this knowledge. We have seen that time series are different from other data sets due to the time component. We saw some relevant examples such as the prices of the Apple Ltd. stock, sunspot activity since the mid1700s and even the exchange rate of bitcoins to US dollars. We were able to deal with these various datasets thanks to Python modules such as Pandas and statsmodels. We saw how Pandas enables us to index our dataframes with time and looked at appropriate transformations that Pandas enables us to carry out such as resampling, slicing and dicing, filtering, aggregating and plotting. In terms of modelling time series, we covered how moving averages and exponential smoothing let us get a first approach at forecasting future values of the series based on previous observations. We discussed the concepts of seasonality and stationarity in a time series. We applied decomposition to our datasets and finally we discussed how autoregression can be used to model time series, combining the topics discussed in this chapter. 2 Speaking Naturally: Text and Natural Language Processing There are many kinds of language: We speak with our body language, need to “mind our language” in certain situations, we learn a foreign language to ask for a pain au chocolat or una cerveza and we need language to understand Presented by le célèbre JeanBrian a French lecture on SheepAircraft. Indeed we are also Zatapathique of course (Baaaa, baaaa). using the Python programming language to create analytics workflows and train machine learning models. We speak the “language of love”, and avoid being confusing by speaking There are all kinds of languages, in “plain language”. What about natural language? Have you including natural language. heard of it? What is it and when do we use it? Let us take a step back: The common theme among the expressions we listed above is communication. In other Natural language has evolved words, the different expressions listed use the word language naturally in humans through to emphasise the fact that we communicate with other humans in a variety of ways. Natural language is one of those forms of communication. The term refers to the use of continued use. 58 j. rogelsalazar any language that has evolved naturally in humans through continued use, repetition and adaptation. English, Spanish, Japanese and Nahuatl are some examples These are some examples of of natural languages. In contrast, languages like Python, natural languages. C++, Scala or Java, as well as Esperanto, Klingon, Elvish or Dothraki are constructed languages. As you can imagine, natural language can take different forms such as speech, writing or even singing. In any case, communicating in a natural (or constructed) language is a useful, if complex, task. You may not notice it all the time, but imagine interviewing a man who speaks entirely in anagrams. The Be ot o