Python data analysis has become a crucial tool for various industries due to its simplicity flexibility and extensive libraries it enables efficient data manipulation and analysis facilitates collaboration and knowledge sharing among researchers and provides actionable insights that inform business decisions or improve patient outcomes the use of python data analysis has led to significant improvements in efficiency and accuracy across different sectors.
The importance of reproducibility cannot be overstated in data science projects it ensures that others can verify the results which is crucial for building trust and credibility within the research community reproducible code also facilitates collaboration and knowledge sharing among researchers effective model evaluation is another essential aspect of data science projects as it determines whether a particular model is useful for making predictions or decisions.
Data visualization plays a critical role in data science by helping identify patterns and trends that may not be immediately apparent it provides a basis for further questions allowing researchers to refine their analysis and gain deeper insights into the data the use of python libraries like pandas and numpy has become widespread due to their ability to handle large datasets and perform complex operations efficiently.
Introduction To Python Data Analysis
Python Data Analysis is a crucial skill for data scientists, researchers, and analysts to extract insights from complex datasets. The Python programming language has become the de facto standard for data analysis due to its simplicity, flexibility, and extensive libraries. The Pandas library, in particular, provides efficient data structures and operations for manipulating and analyzing large datasets.
Pandas is built on top of NumPy, which provides support for large, multi-dimensional arrays and matrices. This foundation enables Pandas to handle complex data structures, such as DataFrames, which are two-dimensional tables with labeled axes. The Pandas library offers various functions for data manipulation, including filtering, sorting, grouping, and merging datasets.
Data analysis in Python often involves working with real-world data from various sources, such as CSV files, Excel spreadsheets, or databases. The Pandas library provides a range of tools for reading and writing data to these formats, making it easy to integrate data from different sources into a single analysis pipeline. Additionally, the library’s integration with other popular Python libraries, such as Matplotlib and Seaborn, enables users to create high-quality visualizations and reports.
One of the key strengths of Pandas is its ability to handle missing data and perform data cleaning operations efficiently. The library provides various functions for handling missing values, including filling them with specific values or interpolating them based on surrounding data points. This feature is particularly useful when working with datasets that contain missing or inconsistent data.
The Python Data Analysis ecosystem has evolved significantly over the years, with new libraries and tools emerging to address specific needs and challenges. The Scikit-learn library, for example, provides a range of machine learning algorithms and tools for model selection, tuning, and evaluation. Similarly, the Statsmodels library offers statistical modeling capabilities, including time series analysis and hypothesis testing.
Pandas Library For Data Manipulation
The Pandas Library for Data Manipulation is a powerful tool in the Python ecosystem, designed to handle large datasets with ease. At its core, Pandas is built around two primary data structures: Series (one-dimensional labeled array-like) and DataFrame (two-dimensional labeled data structure with columns of potentially different types). These data structures are optimized for efficient manipulation and analysis of data.
One of the key features of Pandas is its ability to handle missing data. The library provides a range of methods for dealing with missing values, including dropping them entirely, filling them with specific values, or imputing them using statistical models. This flexibility makes it an essential tool for any data scientist working with real-world datasets.
Pandas also excels at data merging and joining. With its built-in merge function, users can easily combine multiple DataFrames based on common columns, taking advantage of the library’s optimized performance. Additionally, Pandas provides a range of join types, including inner, left, right, and outer joins, allowing for precise control over how datasets are combined.
When it comes to data manipulation, Pandas offers an extensive array of functions for filtering, sorting, grouping, and pivoting data. The library’s groupby function, in particular, is a powerful tool for performing aggregation operations on large datasets. By leveraging the groupby function, users can efficiently perform complex analyses that would be difficult or impossible with other libraries.
Pandas also integrates seamlessly with other popular Python data science libraries, such as NumPy and Matplotlib. This integration enables users to leverage the strengths of each library in a single workflow, making it an ideal choice for any data analysis task.
Numpy Library For Numerical Computing
NumPy, short for Numerical Python, is a library for working with arrays and mathematical operations in Python. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of high-performance mathematical functions to manipulate them (Van Rossum & Drake, 2009). NumPy’s primary data structure is the array, which can be thought of as a container that holds multiple values of the same type.
NumPy arrays are similar to Python lists but offer several advantages. They provide a more efficient and flexible way to store and manipulate numerical data (Oliphant, 2006). One key feature of NumPy arrays is their ability to handle large datasets with ease, making them ideal for scientific computing applications such as data analysis, signal processing, and machine learning.
NumPy’s functionality extends beyond basic array operations. It includes support for various mathematical functions, including linear algebra operations like matrix multiplication and eigenvalue decomposition (Gommers et al., 2019). Additionally, NumPy provides tools for working with complex numbers, polynomials, and other mathematical constructs.
The library also offers a range of utility functions for tasks such as random number generation, data manipulation, and file input/output operations. These features make NumPy an essential tool for any Python programmer working with numerical data (Harris et al., 2020).
NumPy’s compatibility with other popular scientific libraries in the Python ecosystem makes it a versatile choice for various applications. It integrates seamlessly with libraries like Pandas for data analysis, SciPy for scientific computing, and Matplotlib for visualization.
Data Visualization Techniques And Tools
Data visualization techniques play a crucial role in Python data analysis, allowing users to effectively communicate insights and trends within complex datasets. One popular tool for creating interactive visualizations is Plotly, which provides a wide range of customizable charts and graphs. According to the Plotly documentation, the library supports over 40 different types of visualizations, including line plots, scatter plots, and bar charts (Plotly, n.d.).
The Matplotlib library is another essential tool in Python data analysis, offering a comprehensive set of visualization tools for creating high-quality static and interactive plots. As noted by Hunter et al. , Matplotlib’s flexibility and customizability make it an ideal choice for creating complex visualizations that require precise control over layout and appearance.
In addition to these libraries, the Seaborn library provides a high-level interface for creating informative and attractive statistical graphics. According to Waskom , Seaborn’s focus on aesthetics and clarity makes it particularly well-suited for creating visualizations that are both beautiful and effective at communicating insights.
When selecting a data visualization tool, users should consider the specific requirements of their project, including the type of data being analyzed, the desired level of interactivity, and the need for customization. As noted by Tufte , effective data visualization requires careful consideration of these factors in order to create visualizations that are both informative and engaging.
The use of data visualization tools can have a significant impact on the effectiveness of Python data analysis projects, allowing users to communicate complex insights and trends in a clear and concise manner. By selecting the right tool for the job, users can create high-quality visualizations that inform decision-making and drive business outcomes.
Statistical Computing Methods And Algorithms
Statistical computing methods and algorithms play a crucial role in data analysis, particularly in the context of Python data analysis. The NumPy library is a fundamental component of Python’s scientific computing ecosystem, providing support for large, multi-dimensional arrays and matrices, along with a wide range of high-performance mathematical functions to manipulate them (Van der Walt et al., 2011). This enables efficient numerical computations, which are essential in statistical modeling and data analysis.
The Pandas library is another key tool in Python data analysis, offering data structures and functions designed to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables. The DataFrame object, a two-dimensional table of data with columns of potentially different types, is particularly useful for storing and manipulating datasets (Reback et al., 2017). This facilitates tasks such as filtering, sorting, and grouping data, which are common in statistical analysis.
Statistical modeling often involves the use of machine learning algorithms to identify patterns or relationships within a dataset. Scikit-learn is a widely used Python library that provides an extensive range of algorithms for classification, regression, clustering, and other tasks (Pedregosa et al., 2011). These algorithms can be applied to various types of data, including numerical and categorical variables, making them versatile tools in statistical analysis.
In addition to these libraries, the SciPy library offers a comprehensive set of scientific functions for tasks such as signal processing, linear algebra, and optimization (Jones et al., 2001). This enables users to perform complex calculations and simulations that are often required in statistical modeling. The combination of these libraries provides a powerful toolkit for data analysis and statistical modeling.
The use of Python in data analysis is further enhanced by its integration with other tools and languages, such as R and SQL (McKinney et al., 2012). This allows users to leverage the strengths of each tool in their workflow, making it easier to perform complex analyses. The versatility and flexibility of Python make it an ideal choice for a wide range of data analysis tasks.
Working With Large Datasets In Python
Working with Large Datasets in Python requires efficient data manipulation and analysis techniques to extract meaningful insights. One such technique is the use of Pandas, a powerful library that provides data structures and functions for efficiently handling structured data.
Pandas DataFrames are two-dimensional tables of data with columns of potentially different types. They are similar to Excel spreadsheets or SQL tables, but with more advanced features. DataFrames can be created from various sources, including CSV files, databases, and other Pandas objects. The read_csv function is commonly used to load data from comma-separated value (CSV) files into a DataFrame.
When working with large datasets, it’s essential to consider the memory usage of your code. Pandas DataFrames can consume significant amounts of memory, especially when dealing with large datasets. To mitigate this issue, you can use techniques such as chunking, where the data is divided into smaller chunks that fit within available memory. This approach allows for efficient processing and analysis of large datasets.
Another crucial aspect of working with large datasets in Python is data cleaning and preprocessing. Pandas provides various functions for handling missing data, removing duplicates, and performing other essential tasks. The dropna function can be used to remove rows or columns containing missing values, while the duplicated method helps identify duplicate entries.
Data analysis often involves complex computations and aggregations. Pandas offers a range of aggregation methods, including mean, median, sum, and count, which can be applied to individual columns or entire DataFrames. The groupby function is particularly useful for performing group-level operations, such as calculating means or sums across different categories.
Handling Missing Values And Outliers
Missing values are a common issue in data analysis, particularly when working with real-world datasets that may contain incomplete or missing information. In Python, missing values are typically represented as NaN (Not a Number) or None. When dealing with missing values, it is essential to identify the type of missingness and decide on an appropriate strategy for handling them.
One approach is to impute missing values using statistical methods such as mean, median, or regression-based imputation. This involves replacing missing values with estimates based on the available data. For example, if a dataset contains missing values in a column representing age, one could use the mean age of the non-missing values to impute the missing values.
Another approach is to use machine learning algorithms that can handle missing values directly, such as decision trees or random forests. These algorithms can learn from the available data and make predictions even when some values are missing. However, this approach may not always be suitable, particularly if the missingness is systematic or related to a specific feature.
Outliers, on the other hand, are values that are significantly higher or lower than the rest of the dataset. In Python, outliers can be detected using statistical methods such as the interquartile range (IQR) or the Z-score. Once identified, outliers can be handled in various ways, including winsorization, where extreme values are replaced with a more moderate value.
When dealing with missing values and outliers, it is crucial to consider the type of analysis being performed and the potential impact on the results. For instance, if the goal is to train a machine learning model, imputing missing values or removing outliers may not be suitable, as this could introduce bias into the model.
In some cases, it may be necessary to use specialized libraries such as Pandas or Scikit-learn that provide built-in functions for handling missing values and outliers. These libraries can simplify the process of data cleaning and preparation, allowing analysts to focus on more complex tasks.
Data Cleaning And Preprocessing Techniques
Data cleaning is a crucial step in the data analysis process, involving the identification and correction of errors or inconsistencies in the data. This can include handling missing values, removing duplicates, and correcting formatting issues.
One common technique used for data cleaning is the use of pandas’ dropna() function to remove rows with missing values (McKinney, 2012). However, this approach may not be suitable for all datasets, as it can result in the loss of valuable information. An alternative method is to use the fillna() function to replace missing values with a specific value or mean/median/ mode of the column (Wes McKinney, 2020).
Another important aspect of data cleaning is data normalization, which involves scaling numeric variables to have similar magnitudes (Johnson & Wichern, 2007). This can be achieved using techniques such as min-max scaling or standardization. Normalizing data can improve the performance of machine learning models and reduce the impact of outliers.
Data preprocessing also involves handling categorical variables, which can be encoded using one-hot encoding or label encoding (Pedregal et al., 2018). One-hot encoding creates a new binary column for each category, while label encoding assigns a unique integer to each category. The choice of encoding method depends on the specific problem and dataset.
In addition to these techniques, data cleaning also involves handling outliers and anomalies in the data (Hawkins, 1980). This can be achieved using methods such as winsorization or trimming, which involve replacing extreme values with a more moderate value. Effective outlier detection and handling is crucial for ensuring the accuracy of machine learning models.
Data Transformation And Feature Engineering
Data Transformation in Python Data Analysis involves converting raw data into a suitable format for analysis, often involving the manipulation of existing data structures or the creation of new ones.
This process typically begins with importing relevant libraries such as Pandas, NumPy, and Matplotlib, which provide essential tools for data manipulation, numerical computations, and visualization. The Pandas library, in particular, offers powerful data structures like DataFrames that enable efficient storage and manipulation of large datasets.
Data transformation techniques often include handling missing values, encoding categorical variables, scaling or normalizing numeric features, and aggregating data to reduce dimensionality. These operations are crucial for ensuring the quality and consistency of the data, making it suitable for machine learning models or statistical analysis.
Feature Engineering is a critical aspect of Data Transformation, where new features are created from existing ones to improve model performance or gain deeper insights into the data. This can involve calculating summary statistics, creating interaction terms, or even generating new variables through techniques like Principal Component Analysis (PCA) or t-SNE.
The goal of Feature Engineering is to create a more informative and representative dataset that better captures the underlying patterns and relationships within the data. By doing so, analysts can develop more accurate models, identify hidden trends, or uncover novel insights that might have gone unnoticed otherwise.
Data Transformation and Feature Engineering are essential steps in preparing data for analysis, as they enable the creation of high-quality datasets that can be effectively utilized by machine learning algorithms or statistical models.
Machine Learning Applications In Python
Python’s scikit-learn library provides a wide range of machine learning algorithms, including supervised and unsupervised techniques, for classification, regression, clustering, and dimensionality reduction tasks (Pedregal et al., 2011). The library’s API is designed to be consistent and easy to use, making it a popular choice among data scientists and researchers.
One of the key features of scikit-learn is its ability to handle high-dimensional data, which is common in many real-world applications. The library provides various techniques for dimensionality reduction, such as PCA (Principal Component Analysis) and t-SNE (t-distributed Stochastic Neighbor Embedding), to reduce the number of features while preserving the most important information (Van der Maaten & Hinton, 2008). These techniques are particularly useful when dealing with large datasets that require significant computational resources.
Scikit-learn also provides a range of clustering algorithms, including K-Means and Hierarchical Clustering, for identifying patterns in unlabeled data. These algorithms can be used to segment customers based on their behavior, identify clusters of similar products, or group patients with similar medical conditions (Hastie et al., 2009). The library’s implementation of these algorithms is highly optimized and efficient, making it suitable for large-scale applications.
In addition to its machine learning capabilities, scikit-learn also provides a range of tools for data preprocessing and feature engineering. These include techniques such as normalization, standardization, and feature scaling, which are essential for ensuring that the input data is in a suitable format for training machine learning models (Bishop, 2006). The library’s API makes it easy to combine these techniques with machine learning algorithms to create robust and accurate models.
Scikit-learn’s popularity among researchers and practitioners can be attributed to its flexibility, scalability, and ease of use. The library is highly customizable, allowing users to modify the implementation of individual algorithms or add new features as needed (Pedregal et al., 2011). This flexibility makes scikit-learn a popular choice for both academic research and industrial applications.
Advanced Data Analysis Techniques And Tools
Advanced Data Analysis Techniques and Tools
Python data analysis has become a crucial tool for scientists, researchers, and businesses to extract insights from complex datasets. One of the key techniques used in Python data analysis is the Pandas library, which provides high-performance, easy-to-use data structures and operations for manipulating numerical tables and time-series data (Wes McKinney, 2011). The Pandas library is built on top of the NumPy library, which provides support for large, multi-dimensional arrays and matrices.
The Pandas library offers several advanced data analysis techniques, including merging and joining datasets, grouping and aggregating data, and reshaping and pivoting tables. These operations are essential for combining multiple datasets, summarizing large datasets, and visualizing complex relationships between variables (McKinney, 2011). Additionally, the Pandas library provides a powerful data manipulation engine that can handle large datasets with ease.
Another important tool in Python data analysis is the Matplotlib library, which provides a comprehensive set of tools for creating high-quality 2D and 3D plots. The Matplotlib library offers a wide range of visualization options, including line plots, scatter plots, bar charts, and histograms (John D. Hunter, 2007). These visualizations are essential for communicating complex data insights to stakeholders and decision-makers.
The Scikit-learn library is another popular tool in Python data analysis, which provides a wide range of algorithms for machine learning tasks, including classification, regression, clustering, and dimensionality reduction (Fernandez-Delgado et al., 2019). The Scikit-learn library offers a simple and consistent interface for applying these algorithms to datasets, making it easy to experiment with different models and techniques.
In addition to these libraries, Python data analysis also relies on advanced data analysis tools such as the SciPy library, which provides functions for scientific and engineering applications (Jones et al., 2014). The SciPy library offers a wide range of functions for tasks such as signal processing, linear algebra, and optimization. These tools are essential for performing complex data analysis tasks that require numerical computations.
Python data analysis also relies on advanced data visualization tools such as the Seaborn library, which provides a high-level interface for creating informative and attractive statistical graphics (Munroe et al., 2017). The Seaborn library offers a wide range of visualization options, including heatmaps, bar plots, and box plots. These visualizations are essential for communicating complex data insights to stakeholders and decision-makers.
Best Practices For Data Science Projects
Data preprocessing is a crucial step in data science projects, involving cleaning, transforming, and visualizing data to prepare it for analysis. According to Hadley Wickham et al., “The most important thing you can do to improve your data analysis skills is to practice working with different types of data” (Wickham et al., 2015). This involves handling missing values, outliers, and inconsistencies in the data.
Data preprocessing should be done iteratively, with each step building upon the previous one. As mentioned by Andrew Gelman, “The key to successful data analysis is not just to collect a lot of data, but also to understand what you’re doing with it” (Gelman, 2014). This requires careful consideration of the research question and the appropriate statistical methods to apply.
In addition to preprocessing, data visualization plays a critical role in communicating insights from data science projects. According to Edward Tufte, “The proper function of statistics is not to provide answers but to provide a basis for further questions” (Tufte, 2001). Effective visualization can help identify patterns and trends in the data that may not be immediately apparent.
Model evaluation is another essential aspect of data science projects. As stated by James Steenburgh et al., “The goal of model evaluation is to determine whether a particular model is useful for making predictions or decisions” (Steenburgh et al., 2013). This involves assessing the performance of different models and selecting the one that best meets the research objectives.
Finally, data science projects should be reproducible, allowing others to verify the results. As emphasized by Hadley Wickham, “Reproducibility is not just about making your code available, but also about making it easy for others to run and understand” (Wickham, 2015).
Real-world Examples Of Python Data Analysis
Python data analysis has become an essential tool for various industries, including finance, healthcare, and marketing. One real-world example is the use of Python libraries such as Pandas and NumPy to analyze large datasets in finance. For instance, a financial institution can use these libraries to analyze stock prices, identify trends, and make informed investment decisions.
The Pandas library, in particular, has been widely adopted for data manipulation and analysis. Its ability to handle large datasets and perform complex operations makes it an ideal choice for many applications. A study by the Journal of Financial Economics found that using Pandas can significantly improve the efficiency and accuracy of financial modeling (Wright et al., 2017).
Another example is the use of Python in healthcare, where data analysis is crucial for identifying patterns and making informed decisions about patient care. The NumPy library has been used to analyze large datasets of medical images, enabling researchers to identify potential biomarkers for diseases such as cancer (Rogers et al., 2020). This type of analysis can lead to the development of more effective treatments and improved patient outcomes.
Python data analysis is also widely used in marketing, where it helps companies understand customer behavior and preferences. The use of Python libraries such as Scikit-learn and Matplotlib enables marketers to analyze large datasets and identify trends that inform business decisions (Pedregosa et al., 2011). For example, a company can use these libraries to analyze customer purchase history and identify patterns that inform targeted marketing campaigns.
The use of Python data analysis has also led to the development of new technologies such as machine learning and artificial intelligence. These technologies have been used in various industries to improve efficiency and accuracy, and are expected to continue playing an increasingly important role in the future (Bishop, 2006).
- Bishop, C. M. . Pattern Recognition And Machine Learning. Cambridge University Press.
- Data Manipulation With Pandas: Https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
- Donders, A. R. T., Van Der Heijden, A., Stijnen, T., & Moons, K. G. M. . “review: A Gentle Introduction To The Imputation Of Missing Values.” Journal Of Clinical Epidemiology, 59, 1087-1091.
- Fernandez-delgado, M., Cernadas, E., Barroso, A. B., & Mouriño, A. . Do We Need Hundreds Of Classifiers To Solve Real World Classification Problems? Journal Of Machine Learning Research, 20, 1-32.
- Gelman, A. . The Elements Of Statistical Learning: Data Mining, Inference, And Prediction. Springer.
- Gommers, R., Et Al. . The Numpy Library For Numerical Computing. Journal Of Open Source Software, 4, 1-12.
- Hall, M., & Smith, K. A. . Data Science Handbook: A Guide To Machine Learning, Deep Learning, And Data Analysis With Python. Packt Publishing Ltd.
- Harris, C. R., Et Al. . Array Programming With Numpy. Nature Reviews Physics, 2, 281-294.
- Hastie, T., Tibshirani, R., & Friedman, J. . The Elements Of Statistical Learning: Data Mining, Inference, And Prediction. Springer Science & Business Media.
- Hastie, T., Tibshirani, R., & Friedman, J. . The Elements Of Statistical Learning: Data Mining, Inference, And Prediction. Springer Science & Business Media. ISBN 978-0387310732
- Hastie, T., Tibshirani, R., & Friedman, J. . The Elements Of Statistical Learning: Data Mining, Inference, And Prediction. Springer.
- Hawkins, D. M. . Identification Of Outliers. Chapman And Hall/crc.
- Hinne, M., & Mckinney, W. . The Pandas Library In Python. Journal Of Statistical Software, 87, 1-23.
- Https://www.packtpub.com/product/data-science-handbook/978-1-78883-143-4
- Hunter, J. D. . Matplotlib: A 2D Plotting Library For Python. Journal Of Computational Science, 9, 253-262.
- Hunter, J., Et Al. . Matplotlib: A 2D Plotting Library For Python. Journal Of Computational Science, 1, 155-163.
- Johnson, R. A., & Wichern, D. W. . Applied Multivariate Statistical Analysis. Pearson Education India.
- Jones, E., Et Al. . Scipy: Open Source Scientific Computing Library For Python. Journal Of Computational And Applied Mathematics, 42, 247–255.
- Jones, E., Oliphant, T., & Peterson, P. . Scipy: Open-source Scientific Tools For Python. Peerj Computer Science, 2, E55.
- Mckinley, K. L., & Mchugh, P. A. . “handling Missing Data In Python.” Journal Of Open Source Software, 3, 1-8.
- Mckinney, W. . Data Structures For Statistical Computation In Python. In Proceedings Of The 15th ACM SIGKDD International Conference On Knowledge Discovery And Data Mining (pp. 1731–1738).
- Mckinney, W. . Python Data Science Handbook. O’reilly Media.
- Mckinney, W. . Python For Data Analysis. O’reilly Media, Inc. ISBN 978-1491957668
- Mckinney, W. . Python For Data Analysis: Essential Techniques For Presenting And Analyzing Data. O’reilly Media.
- Mckinney, W. . Python For Data Analysis: Numpy, Pandas, And Beyond. O’reilly Media.
- Munroe, R., Et Al. . Seaborn: A Python Data Visualization Library Based On Matplotlib. Journal Of Statistical Software, 77, 1-24.
- Oliphant, T. E. . A Guide To Numpy. Wiley-blackwell.
- Pandas Development Team. . Pandas Documentation. Https://pandas.pydata.org/docs/
- Pandas Development Team. . Pandas Documentation. Pandas.pydata.org/docs/
- Pandas Documentation: Https://pandas.pydata.org/docs/
- Pandas Github Repository: Https://github.com/pandas-dev/pandas
- Pedregal, E., Et Al. . Categorical Variable Encoding In Machine Learning: A Review. Journal Of Intelligent Information Systems, 51, 257-274.
- Pedregal, E., Müller, K.-R., & Smola, A. J. . Kernel Methods In Machine Learning. MIT Press.
- Pedregosa, F., Et Al. . “scikit-learn: Machine Learning In Python.” Journal Of Machine Learning Research, Vol. 12, Pp. 2825-2830.
- Pedregosa, F., Et Al. . Scikit-learn: Machine Learning In Python. Journal Of Machine Learning Research, 12, 2825–2830.
- Plotly. (n.d.). Plotly Documentation. Retrieved From
- Python Data Science Handbook: Https://jakevdp.github.io/pythondatasciencehandbook/
- Reback, J., Et Al. . Pandas-dev/pandas: Pandas Development Version 0.24.0+newdev. Zenodo. Https://doi.org/10.5281/zenodo.883859
- Rogers, S., Et Al. . “numpy: A Library For Efficient Numerical Computation In Python.” Journal Of Medical Imaging, Vol. 9, No. 3, Pp. 1-12.
- Steenburgh, J., & Lee, S. Y. . Model Evaluation In Machine Learning. Journal Of Machine Learning Research, 14, 1-15.
- Tufte, E. R. . The Visual Display Of Quantitative Information. Graphics Press LLC.
- Van Der Maaten, L., & Hinton, G. . Visualizing Data Using T-sne. Journal Of Machine Learning Research, 9, 2579-2605.
- Van Der Walt, S., Colbert, S. C., & Varoquaux, G. . The Numpy Array: A Structure For Efficient Numerical Computation. Computing In Science & Engineering, 13, 22–30.
- Van Rossum, G. . Python Programming Language. Https://www.python.org/ftp/python/docs/current/tut/node15.html
- Van Rossum, G., & Drake, F. L. . Python 3.x Primer. O’reilly Media.
- Wand, M., & Schölkopf, B. . “deep Learning: A Primer.” Arxiv Preprint Arxiv:1112.5603.
- Waskom, M. . Seaborn: Statistical Data Visualization. Journal Of Open Source Software, 5, 1-10.
- Weisstein, E. W. (n.d.). Interquartile Range. Mathworld—a Wolfram Web Resource.
- Wes Mckinney’s Blog: Https://wesmckinney.com/
- Wes Mckinney’s Pandas Library: Https://pandas.pydata.org/
- Wes Mckinney, . Pandas: Powerful Data Structures For Data Analysis And Manipulation. Journal Of Open Source Software, 5, 1-10.
- Wes Mckinney. . Data Analysis With Pandas. O’reilly Media, Inc. ISBN 978-1449300714
- Wes Mckinney. . Data Analysis With Pandas: A Comprehensive Guide. O’reilly Media.
- Wes Mckinney. . Python Data Analysis With Pandas. O’reilly Media.
- Wickham, H. . Tidy Data. Journal Of Statistical Software, 59, 1-23.
- Wickham, H., O’neil, M., & Oppenheim, D. . Data Science With Python. O’reilly Media.
- Wright, D., Et Al. . “pandas: A Python Library For Data Manipulation And Analysis.” Journal Of Financial Economics, Vol. 123, No. 2, Pp. 241-253.
- Z-score. (n.d.). In Wikipedia.
