Python vs Microsoft Excel: Choosing the Right Option for Data Analysis
Systems biology adopts and integrates multi-omics approaches to understand human biology. The rapid development of omics tools, including genomics, transcriptomics, proteomics and metabolomics, along with bioinformatics and computer science has helped tremendously in our understanding of biological systems at the molecular level, providing deeper mechanistic insights into our understanding of human health and disease.
High-throughput technologies have made data collection much easier than before. However, the analysis and integration of big data are becoming much more challenging. A more fundamental question arises: is Microsoft Excel sufficient for data analysis or is learning Python necessary to get more insights from the data? To answer this, we need to decipher the situations where using Python is superior to Excel.
Here, I have examined 5 different possibilities to justify why one should consider learning Python for analysing large datasets:
1. Python enables users to assign processed datasets in codes or variables, mitigating the need for storing multiple files or spreadsheets.
Most data will need to be preprocessed before data analysis can be performed. Examples include imputing values for missing data, removing rows or columns with too much missing data and data scaling to allow machine learning models to work better.
These tasks may be easy to perform on Microsoft Excel if the dataset is small. However, if analysing a dataset with dimensions of 20 variables with 20,000 genes (with a total of 400,000 numerical values), users will typically have to save them individually processed files in different excel sheets and files to reduce lagging issues.
In Python, processing these big datasets is much simpler with just a few lines of code using the pandas library in Python.
2. Python is more consistent and less error-prone as compared to Excel.
Using Excel files to analyse small datasets is convenient and is hence, usually preferred. However, with large datasets, the drag and drop function is not perfect. For instance, a lot of time and effort is needed to drag the cells if you have a large number of variables. The drop function is also not useful if you have empty cells in your dataset.
Finally, combining “filter” or “sort” functions with formulas is cumbersome in Excel, especially for large datasets. In Python, however, filtering, grouping, sorting and manipulation of tables can be easily executed in pandas.
3. Python is more time and space-efficient in processing large datasets.
The maximum limit of Excel is 1,048,576 rows by 16,384 columns, which means that you will have to store all your genes from microarray or RNAseq in rows. Even if you stored all your genes in rows, performing excel functions in large datasets can take a long time (can take several minutes).
In some scenarios, due to the long processing time, the excel program may shut down, causing you to lose your updated file. This means that you will have to save your files regularly and each time you press “save”, it takes a few minutes, which means a lot of time wasted on waiting.
In Python, the codes are automatically saved and the DataFrame dimensions are limitless, which means you can manage and transpose the files whenever you like, and the processing is usually completed within seconds.
Another point to note is that all the codes can be saved and shared, which means that the same codes can be used to compare with another big dataset. This saves a lot of time and reduces the need to manage a large number of files.
4. Searching for variables across different datasets are quicker in Python than in Excel.
Within one dataset, Excel can do multiple gene queries with the INDEX function. However, for multiple large datasets, this is not efficient. First, you will have to open multiple spreadsheets, and then use the INDEX function to query each and every spreadsheet. Overall, this can take a long time if you have many spreadsheets.
In Python, after assigning your index column(s), you can use the iloc functions to perform the specific query for every dataset. You can even use the multiquery function in Streamlit to perform query functions. Hence, data query of a large number of variables in multiple datasets is usually more efficient with Python than Excel.
5. Python can draw more interactive and sophisticated graphs than Excel.
With more people sharing their codes and developing packages, Python can plot more sophisticated graphs which is important for data visualisation of large datasets. For instance, Python can do hierarchical clustering and heatmaps easier than excels. Complex functions involving unsupervised clustering, such as principal component analyses are also easier to perform in python. Finally, the Plotly and Altair libraries in Python makes graphs interactive, allowing users to hover over data points to acquire individual data point characteristics.
Other reasons give Python a competitive edge, including the ease of implementing machine learning models, building data dashboards for clients and sharing of data analysis workflows. I believe in learning the fundamentals first, before delving into the sophisticated models and framework. A good place to start is to read the pandas documentation. Another website that provides good introduction can be found here as well.
Thus, we can conclude that Python is better than Excel for data analysis.
I hope this helps you in making your choice. Thank you for reading!