The Ultimate Collection: 125 Python Packages for Data Science, Machine Learning, and Beyond
All about python packages
Introduction:
Python, one of the world’s most popular programming languages, boasts a vast ecosystem of modules and packages, with over 350,000 available to developers. This rich collection of resources empowers Python developers to tackle a diverse range of tasks, from data analysis and machine learning to web development and automation. This article provides the most important modules in areas such as data science, machine learning, web development, and more.
Most downloaded PyPI packages.
Contents:
Let's see the important packages used in python.
1. Calendar:
import calendar
cal = calendar.TextCalendar()
cal.prmonth(2023, 3)
March 2023
Mo Tu We Th Fr Sa Su
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31
- General calendar-related functions.
- Display calendars and handle dates.
- Support for leap years, weekdays, and month ranges
2. Collections:
from collections import Counter
words = ["apple", "banana", "apple", "orange", "banana", "apple"]
word_count = Counter(words)
print(word_count)
Counter({'apple': 3, 'banana': 2, 'orange': 1})
- Specialized container datatypes.
- Counter, defaultdict, OrderedDict, namedtuple, and deque.
- More efficient and flexible alternatives to built-in types.
3. bisect:
import bisect
sorted_list = [1, 3, 4, 4, 6, 8]
position = bisect.bisect_left(sorted_list, 4)
print(position)
#2
- Array bisection algorithms for sorted sequences.
- Binary search, insertion, and more.
- Efficiently find and maintain sorted order in lists.
4. heapq:
import heapq
nums = [4, 7, 2, 5, 1, 3]
heapq.heapify(nums)
smallest = heapq.heappop(nums)
print(smallest)
#1
- Heap queue algorithms (priority queues).
- Maintain a sorted collection of items with efficient insertions and removals.
- Useful for scheduling, priority-based tasks, and more.
5.json:
import json
data = {"name": "John", "age": 30}
json_string = json.dumps(data)
decoded_data = json.loads(json_string)
print(json_string)
print(decoded_data)
#{"name": "John", "age": 30}
#{'name': 'John', 'age': 30}
- Encode and decode JSON data.
- Serialize and deserialize Python objects to JSON format.
- Store and exchange data in a lightweight, human-readable format.
6. configparser:
import configparser
config = configparser.ConfigParser()
config.read("example.ini")
name = config.get("section", "name")
age = config.getint("section", "age")
print(name, age)
- Configuration file parser
- Read and write data from INI files
- Manage application settings and user preferences
7. sched:
import sched
import time
def print_event(event_name):
print(f"Event: {event_name}")
s = sched.scheduler(time.time, time.sleep)
s.enter(5, 1, print_event, ("Event 1",))
s.enter(10, 1, print_event, ("Event 2",))
s.run()
#Event: Event 1
#Event: Event 2
- General-purpose event scheduler
- Schedule and execute tasks at specific times or intervals
- Perform timed operations, such as periodic updates or reminders
8. random:
import random
random_float = random.random()
random_int = random.randint(1, 10)
random_choice = random.choice(["apple", "banana", "orange"])
print(random_float)
print(random_int)
print(random_choice)
#0.8998656473050128
#6
#orange
- Generate random numbers and make random choices
- Support for uniform, Gaussian, and other distributions
- Shuffle and sample from sequences
9. secrets:
import secrets
token = secrets.token_hex(16)
p
print(token)
#97f3108ff85aef4b0b00c3c2154ae873
10. difflib:
import difflib
text1 = "The quick brown fox jumps over the lazy dog"
text2 = "The quick red fox jumps over the lazy dog"
differ = difflib.Differ()
diff = list(differ.compare(text1.split(), text2.split()))
print("\n".join(diff))
The
quick
- brown
+ red
fox
jumps
over
the
lazy
dog
- Helpers for computing differences between sequences
- Create and display diffs for text, lists, and other data
- Identify changes, corrections, and updates in the data
11. timeit:
import timeit
def slow_function():
return sum(range(100000))
execution_time = timeit.timeit(slow_function, number=100)
print(f"Execution time: {execution_time} seconds")
#Execution time: 0.2170043410001199 seconds
- Measure the execution time of small code snippets
- Test performance and optimize code
- Compare the speed of different solutions and implementations
12. pbt:
import pdb
def buggy_function(x):
y = x * 2
pdb.set_trace()
return y + x
result = buggy_function(10)
print(result)
- Python debugger
- Set breakpoints, step through code, and inspect variables
- Debug and troubleshoot code interactively
13. xml.etree.ElementTree:
import xml.etree.ElementTree as ET
xml_string = """
<person>
<name>Alice</name>
<age>30</age>
<city>New York</city>
</person>
"""
root = ET.fromstring(xml_string)
name = root.find("name").text
age = int(root.find("age").text)
city = root.find("city").text
print(name, age, city)
#Alice 30 New York
- XML processing library
- Parse, manipulate, and generate XML documents
- Read and write XML data in a structured and hierarchical format
14. HTMLParser:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(f"Start tag: {tag}")
def handle_endtag(self, tag):
print(f"End tag: {tag}")
def handle_data(self, data):
print(f"Data: {data.strip()}")
parser = MyHTMLParser()
parser.feed("<html><head><title>Example</title></head><body><p>Hello, world!</p></body></html>")
Start tag: html
Start tag: head
Start tag: title
Data: Example
End tag: title
End tag: head
Start tag: body
Start tag: p
Data: Hello, world!
End tag: p
End tag: body
End tag: html
- Basic HTML and XHTML parser
- Extract information and data from HTML documents
- Build custom web scrapers and data extraction tools
15. re:
import re
text = "The quick brown fox jumps over the lazy dog"
pattern = r"\b\w{5}\b"
five_letter_words = re.findall(pattern, text)
print(five_letter_words)
# ['quick', 'brown', 'jumps']
- Regular expression operations
- Search, match, and manipulate text based on patterns
- Validate and clean data, extract information, and perform advanced text manipulation
16. argparse:
import argparse
parser = argparse.ArgumentParser(description="Example command line program.")
parser.add_argument("--input", type=str, required=True, help="Input file")
parser.add_argument("--output", type=str, help="Output file")
args = parser.parse_args()
print(f"Input file: {args.input}")
print(f"Output file: {args.output}")
- Command-line option and argument parsing
- Create user-friendly command-line interfaces for your scripts and programs
- Handle options, arguments, and flags with ease and flexibility
17. logging:
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logging.debug("This is a debug message")
logging.info("This is an info message")
logging.warning("This is a warning message")
logging.error("This is an error message")
logging.critical("This is a critical message")
WARNING:root:This is a warning message
ERROR:root:This is an error message
CRITICAL:root:This is a critical message
- Flexible event logging system
- Log messages with different severity levels to various outputs
- Debug, monitor, and analyze your applications and systems
18. decimal:
from decimal import Decimal
a = Decimal("0.1")
b = Decimal("0.2")
c = a + b
print(c)
#0.3
- Decimal fixed-point and floating-point arithmetic
- Perform precise and accurate calculations with decimal numbers
- Suitable for financial applications, scientific simulations, and other numerically sensitive tasks
19. fractions:
from fractions import Fraction
a = Fraction(1, 3)
b = Fraction(1, 6)
c = a + b
print(c)
#1/2
- Rational number arithmetic
- Perform calculations with fractions and exact rational numbers
- Useful for exact arithmetic and applications with a focus on accuracy and precision
20. sqlite3:
import sqlite3
conn = sqlite3.connect("example.db")
c = conn.cursor()
c.execute("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name TEXT, age INTEGER)")
c.execute("INSERT INTO users (name, age) VALUES (?, ?)", ("Alice", 30))
conn.commit()
for row in c.execute("SELECT * FROM users"):
print(row)
conn.close()
# (1, 'Alice', 30)
- Manage and interact with SQLite databases directly from your Python code
- Store, query, and manipulate data in a lightweight, serverless, and self-contained format
21. requests:
import requests
response = requests.get("https://api.example.com/data")
if response.status_code == 200:
data = response.json()
print(data)
else:
print(f"Error: {response.status_code}")
- HTTP library for making requests
- Interact with RESTful APIs, download files, and scrape web content
- Simplify HTTP requests with a user-friendly and feature-rich API
22. flask:
from flask import Flask, jsonify
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello, World!"
@app.route("/api/data")
def api_data():
data = {"name": "Alice", "age": 30}
return jsonify(data)
if __name__ == "__main__":
app.run()
* Serving Flask app '__main__'
* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000
INFO:werkzeug:Press CTRL+C to quit
- Lightweight web application framework
- Build web applications, APIs, and microservices quickly and easily
- Offers a flexible and extensible architecture for your web projects
23. Pytest:
# Install pytest using pip:
# pip install pytest
# Create a test file named "test_example.py" with the following content:
def add(a, b):
return a + b
def test_add():
assert add(1, 2) == 3
assert add(-1, 1) == 0
assert add(0, 0) == 0
# To run the tests, use the `pytest` command-line tool:
# pytest test_example.py
- Testing framework for Python applications
- Write and organize tests for your code, libraries, and projects
- Discover, execute, and report on tests with ease and flexibility
24. scipy:
import numpy as np
from scipy.optimize import minimize
def objective_function(x):
return x[0] ** 2 + x[1] ** 2
initial_guess = np.array([1, 1])
result = minimize(objective_function, initial_guess)
print(result.x)
[-1.07505143e-08 -1.07505143e-08]
- Scientific computing library
- Provides a wide range of algorithms and tools for optimization, integration, interpolation, and more
- Builds on the NumPy library to offer advanced functionality for scientific applications and research
25. os:
import os
print(os.listdir("."))
#['.config', 'example.db', 'sample_data']
- Interact with the operating system
- File and directory management
- Environment variables and process information
26. glob:
import glob
print(glob.glob("*.txt"))
- Find all pathnames matching a specified pattern
- Wildcard pattern matching
- The simple way to list files in a directory
27. itertools:
import itertools
for combo in itertools.combinations("ABC", 2):
print(combo)
('A', 'B')
('A', 'C')
('B', 'C')
- Iterator building blocks
- Combinatoric generators, such as permutations and combinations
- Memory-efficient looping
28. time:
import time
start_time = time.time()
time.sleep(2)
end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")
Elapsed time: 2.002392292022705 seconds
- Time access and conversions
- Measure performance and delays
- Handle time zones and daylight saving time
29. datetime:
from datetime import datetime, timedelta
now = datetime.now()
one_week_from_now = now + timedelta(weeks=1)
print(now)
print(one_week_from_now)
2023-03-19 20:02:27.337600
2023-03-26 20:02:27.337600
- Manipulate dates and times
- Perform arithmetic with dates
- Format and parse dates and times
30. hashlib:
import hashlib
message = "Hello, world!"
hashed = hashlib.sha256(message.encode("utf-8")).hexdigest()
print(hashed)
# 315f5bdb76d078c43b8ac0064e4a0164612b1fce77c869345bfc94c75894edd3
- Cryptographic hashing and message digest algorithms
- Support for SHA, MD5, and other hash functions
- Create secure hashes and check message integrity
31. urllib:
from urllib.request import urlopen
from urllib.parse import urlparse
url = "https://www.medium.com"
response = urlopen(url)
parsed_url = urlparse(url)
print(response.read())
print(parsed_url)
- URL handling modules
- Open URLs, encode and decode data, and parse URLs
- Work with HTTP, HTTPS, and FTP protocols
32. flake8:
# example.py
def add(a, b):
return a + b
print(add(1, 2))
- Run
flake8 example.py
to check for style and quality - Enforce PEP 8 code style
- Catch errors and improve code readability
33. pathlib:
from pathlib import Path
current_dir = Path(".")
for file in current_dir.iterdir():
print(file)
#.config
#example.db
#sample_data
- Object-oriented filesystem paths
- Simplifies file and directory operations
- Works on Windows, macOS, and Linux
34. smtplib:
import smtplib
from_email = "you@example.com"
to_email = "recipient@example.com"
message = f"Subject: Hello\n\nHello, {to_email}!"
with smtplib.SMTP("smtp.example.com", 587) as server:
server.starttls()
server.login(from_email, "your-password")
server.sendmail(from_email, to_email, message)
- Send emails using the Simple Mail Transfer Protocol (SMTP)
- Authenticate with email servers
- Send plain-text and HTML emails
35. email:
from email.message import EmailMessage
import smtplib
msg = EmailMessage()
msg.set_content("Hello, world!")
msg["Subject"] = "Greetings"
msg["From"] = "you@example.com"
msg["To"] = "recipient@example.com"
with smtplib.SMTP("smtp.example.com", 587) as server:
server.starttls()
server.login("you@example.com", "your-password")
server.send_message(msg)
- Create, manipulate, and parse email messages
- Complements the
smtplib
andimaplib
modules
35. yaml:
import yaml
data = {"name": "John", "age": 30}
yaml_data = yaml.dump(data)
print(yaml_data)
age: 30
name: John
- Read and write YAML (YAML Ain’t Markup Language) files
- Human-readable data serialization format
- Supports complex data structures and custom data types
36. platform:
import platform
print(platform.system())
print(platform.python_version())
Linux
3.9.16
- Access system and platform information
- Retrieve the OS version, hardware details, and Python version
- Write cross-platform code
37. math:
import math
print(math.sqrt(9))
print(math.sin(math.pi / 6))
3.0
0.49999999999999994
- Basic mathematical functions and constants
- Trigonometry, logarithms, exponentiation, and more
- Floating-point arithmetic and rounding functions
38. Statistics:
import statistics
data = [1, 2, 3, 4, 5, 6]
print(statistics.mean(data))
print(statistics.median(data))
print(statistics.stdev(data))
3.5
3.5
1.8708286933869707
- Basic statistical functions
- Calculate the mean, median, mode, variance, and standard deviation
- Handle data sets with missing or infinite values
39. queue:
import queue
q = queue.Queue()
for i in range(5):
q.put(i)
while not q.empty():
print(q.get())
0
1
2
3
4
- FIFO (first-in, first-out), LIFO (last-in, first-out), and priority queues
40. tempfile:
import tempfile
with tempfile.NamedTemporaryFile(mode="w+t") as temp_file:
temp_file.write("Hello, world!")
temp_file.seek(0)
print(temp_file.read())
Hello, world!
- Create temporary files and directories
- Automatically clean up resources after use
- Named and unnamed, file-like objects and file system entries
41. uuid:
import uuid
random_uuid = uuid.uuid4()
print(random_uuid)
4cef1f15-10ee-4dbb-84ec-c6527925258b
- Universally unique identifiers (UUIDs)
- Generate and manipulate 128-bit identifiers
- Useful for generating unique keys, identifiers, or tokens
42. zipfile:
import zipfile
with zipfile.ZipFile("archive.zip", "w") as zf:
zf.write("file.txt")
- Use the ‘zipfile’ module to create and extract ZIP archives.
43. csv:
import csv
with open("data.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Name", "Age", "City"])
writer.writerow(["Alice", 30, "New York"])
Use the ‘csv’ module to read and write CSV files.
44. copy:
import copy
original_list = [[1, 2], [3, 4]]
shallow_copy = copy.copy(original_list)
deep_copy = copy.deepcopy(original_list)
Use the ‘copy’ module to create shallow and deep copies of lists or other mutable objects.
45. atexit:
import atexit
def goodbye():
print("Goodbye!")
atexit.register(goodbye)
Use the ‘atexit’ module to register functions to be called when the program exits.
46. pickle:
import pickle
data = {"a": 1, "b": 2}
serialized = pickle.dumps(data)
deserialized = pickle.loads(serialized)
- Use the ‘pickle’ module to serialize and deserialize Python objects, which can be useful for storing data or communication between processes.
47. pprint:
import pprint
data = {"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]}
pprint.pprint(data, indent=4)
{'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}
Use the ‘pprint’ module to pretty-print complex data structures for better readability.
48. fileinput:
import fileinput
with fileinput.input(files=('file1.txt', 'file2.txt')) as f:
for line in f:
print(f.filename(), f.lineno(), line, end='')
Use the ‘fileinput’ module to read multiple files line by line, treating them as a single input stream.
49. doctest:
def add(a, b):
"""
>>> add(1, 2)
3
"""
return a + b
if __name__ == "__main__":
import doctest
doctest.testmod()
Use the ‘doctest’ module to test your code by writing examples in your function’s docstring, allowing you to keep documentation and tests close to the code.
50. inspect:
import inspect
def my_function():
pass
print(inspect.getsource(my_function))
def my_function():
pass
Use the ‘inspect’ module to retrieve information about live objects, such as their source code, documentation, or call stack.
51. locale:
import locale
locale.setlocale(locale.LC_ALL, '')
formatted_number = locale.format_string("%d", 1234567, grouping=True)
Use the ‘locale’ module to work with locale-specific formatting of numbers, dates, and currency.
52. traceback:
import traceback
try:
1 / 0
except ZeroDivisionError:
traceback.print_exc()
Traceback (most recent call last):
File "<ipython-input-43-507d9690673b>", line 4, in <module>
1 / 0
ZeroDivisionError: division by zero
Use the ‘traceback’ module to print and format exception tracebacks, which can be useful for debugging and logging.
53. zlib:
import zlib
data = b"example data" * 100
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
Use the ‘zlib’ module to compress and decompress data using the DEFLATE algorithm, which can be useful for reducing storage space or network bandwidth usage.
54. fnmatch:
import fnmatch
filenames = ["file1.txt", "file2.pdf", "file3.txt"]
txt_files = [name for name in filenames if fnmatch.fnmatch(name, "*.txt")]
Use the ‘fnmatch’ module to filter filenames or other strings using shell-style wildcards, which can be helpful when working with file lists.
55. sys:
import sys
python_version = sys.version
script_name = sys.argv[0]
Use the ‘sys’ module to access system-specific parameters and functions, such as command-line arguments, Python version, or the current interpreter’s path.
56. shutil:
import shutil
shutil.copy("source.txt", "destination.txt")
shutil.move("old_location.txt", "new_location.txt")
Use the ‘shutil’ module to perform high-level file operations, such as copying or moving files and directories.
57. typing:
from typing import List, Tuple
def my_function(numbers: List[int]) -> Tuple[int, int]:
return min(numbers), max(numbers)
Use the ‘typing’ module to annotate your functions and classes with type hints, which can improve code readability and facilitate static type checking with tools like Mypy.
58. pkgutil:
import pkgutil
for importer, module_name, _ in pkgutil.iter_modules():
print(module_name)
Use the ‘pkgutil’ module to work with Python packages, such as listing all installed modules or iterating through a package’s modules.
59. array:
import array
arr = array.array("i", [1, 2, 3, 4, 5])
Use the ‘array’ module to create and manipulate arrays of fixed-size numeric types, which can be more memory-efficient and faster than using lists for large amounts of numerical data.
60. shelve:
import shelve
with shelve.open("my_shelf") as db:
db["data"] = {"key": "value"}
with shelve.open("my_shelf") as db:
print(db["data"])
Use the ‘shelve’ module to create and work with persistent dictionaries, which store key-value pairs on disk and can be used as a simple database for small-scale applications.
61. numpy:
import numpy as np
array = np.array([1, 2, 3, 4, 5])
mean = np.mean(array)
NumPy is a powerful library for numerical computing in Python, providing support for arrays, matrices, and various mathematical operations.
62. pandas:
import pandas as pd
data = {"A": [1, 2, 3], "B": [4, 5, 6]}
df = pd.DataFrame(data)
pandas is a popular library for data manipulation and analysis, providing data structures like DataFrame and Series for handling tabular and time-series data.
63. matplotlib:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
plt.plot(x, y)
plt.show()
matplotlib is a plotting library for creating static, interactive, and animated visualizations in Python, such as line, scatter, and bar plots.
64. seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()
seaborn is a statistical data visualization library built on top of matplotlib, providing a high-level interface for creating informative and attractive statistical graphics.
65. sci-kit learn:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
scikit-learn is a popular library for machine learning in Python, providing tools for classification, regression, clustering, and various other learning tasks.
66. statsmodel:
import statsmodels.api as sm
import numpy as np
X = np.random.rand(100)
y = 2 * X + 0.5 * np.random.randn(100)
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
statsmodels is a library for estimating statistical models and performing statistical tests, offering a wide range of statistical models such as linear regression, logistic regression, and time series analysis.
67. plotly:
import plotly.express as px
data = px.data.iris()
fig = px.scatter(data, x="sepal_width", y="sepal_length", color="species")
fig.show()
Plotly is a library for creating interactive, web-based visualizations using Python, such as line charts, bar charts and scatter plots, with support for advanced features like animations and 3D plots.
68. bokeh:
from bokeh.plotting import figure, output_file, show
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
output_file("line.html")
p = figure(title="Line plot example", x_axis_label="x", y_axis_label="y")
p.line(x, y, legend_label="y=2x", line_width=2)
show(p)
Bokeh is a library for creating interactive, web-based visualizations in Python, providing a flexible and high-level interface for creating complex, feature-rich plots.
69. folium:
import folium
m = folium.Map(location=[45.523, -122.675], zoom_start=13)
folium.Marker([45.524, -122.674], popup="Portland, Oregon").add_to(m)
m.save("map.html")
folium is a library for creating interactive maps using Python and the popular JavaScript library Leaflet, allowing you to easily visualize spatial data and add various map layers and markers.
70. geopandas:
import geopandas as gpd
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world.plot()
geopandas is a library for working with geospatial data in Python, extending the functionality of pandas by adding support for geospatial data types and operations.
71. wordcloud:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "Python is a great language for data analysis and visualization"
wc = WordCloud(background_color="white").generate(text)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
wordcloud is a library for creating word clouds, which are a popular way to visualize the frequency of words in a text corpus.
72. pydot:
import pydot
graph = pydot.Dot(graph_type="digraph")
node_a = pydot.Node("A")
node_b = pydot.Node("B")
edge = pydot.Edge(node_a, node_b)
graph.add_edge(edge)
graph.write_png("graph.png")
pydot is a library for creating and manipulating graph descriptions in the DOT language, which is a popular language for representing directed and undirected graphs.
73. tensorflow:
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5)
TensorFlow is an open-source library for machine learning and artificial intelligence, providing a flexible platform for defining and running computational graphs, with support for deep learning models.
74. pytorch:
import torch
import torch.nn as nn
import torch.optim as optim
X = torch.randn(100, 20)
y = torch.randint(0, 2, (100,))
model = nn.Sequential(nn.Linear(20, 10), nn.ReLU(), nn.Linear(10, 2))
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
for epoch in range(10):
optimizer.zero_grad()
y_pred = model(X)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
PyTorch is an open-source machine learning library based on the Torch library, providing tensor computation and deep learning capabilities with strong GPU acceleration support.
75. xgboost:
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = {"objective": "reg:squarederror", "eval_metric": "rmse"}
bst = xgb.train(param, dtrain, num_boost_round=100, evals=[(dtest, "test")])
y_pred = bst.predict(dtest)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
XGBoost is a scalable and high-performance gradient boosting library for tree-based models, providing a flexible and efficient solution for supervised learning tasks.
76. lightgbm:
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtrain = lgb.Dataset(X_train, label=y_train)
dtest = lgb.Dataset(X_test, label=y_test)
param = {"objective": "binary", "metric": "binary_logloss"}
bst = lgb.train(param, dtrain, num_boost_round=100, valid_sets=[dtest])
y_pred = np.round(bst.predict(X_test))
accuracy = accuracy_score(y_test, y_pred)
LightGBM is a gradient boosting framework that uses tree-based learning algorithms, designed to be efficient and scalable for large datasets and high-performance tasks.
77. catboost:
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=2, verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
CatBoost is a high-performance gradient boosting library, designed specifically for categorical feature handling and improving performance on datasets with categorical features.
78. Beautiful Soup:
from bs4 import BeautifulSoup
import requests
url = "https://www.medium.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
headings = soup.find_all("h1")
for heading in headings:
print(heading.text)
BeautifulSoup is a library for parsing HTML and XML documents, providing an easy-to-use interface for navigating, searching, and modifying the parse tree.
79. Scrapy:
import scrapy
from scrapy.crawler import CrawlerProcess
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ["https://www.example.com"]
def parse(self, response):
headings = response.css("h1::text").getall()
print(headings)
process = CrawlerProcess()
process.crawl(ExampleSpider)
process.start()
crapy is an open-source web crawling framework.
80. nltk:
import nltk
nltk.download("punkt")
text = "Natural Language Processing is an interesting field of study."
tokens = nltk.word_tokenize(text)
print(tokens)
nltk (Natural Language Toolkit) is a library for working with human language data (text), providing tools for text processing, classification, tokenization, stemming, and more.
81. spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. is a technology company based in Cupertino, California."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
spaCy is an open-source library for advanced Natural Language Processing tasks, providing support for part-of-speech tagging, named entity recognition, and various other NLP tasks.
82. genism:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
sentences = ["Machine learning is a subset of artificial intelligence.",
"Deep learning is a subfield of machine learning."]
tokenized_sentences = [simple_preprocess(s) for s in sentences]
model = Word2Vec(tokenized_sentences, min_count=1)
print(model.wv["machine"])
gensim is a library for unsupervised topic modeling and natural language processing, providing implementations for popular algorithms like Word2Vec, FastText, and LDA.
83. pymongo:
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["example_db"]
collection = db["example_collection"]
data = {"name": "John Doe", "age": 30, "city": "New York"}
result = collection.insert_one(data)
print(result.inserted_id)
pymongo is a Python driver for MongoDB, allowing you to work with MongoDB databases using Python-like syntax and data structures.
84. openpyxl:
import openpyxl
workbook = openpyxl.Workbook()
sheet = workbook.active
sheet["A1"] = "Hello"
sheet["B1"] = "World"
workbook.save("example.xlsx")
openpyxl is a library for reading and writing Excel files (xlsx), allowing you to work with Excel spreadsheets using Python.
85. xlrd:
import xlrd
workbook = xlrd.open_workbook("example.xls")
sheet = workbook.sheet_by_index(0)
for row in range(sheet.nrows):
print(sheet.row_values(row))
xlrd is a library for reading data and formatting information from Excel files (xls and xlsx), allowing you to extract data from Excel spreadsheets using Python.
86. xlwt:
import xlwt
workbook = xlwt.Workbook()
sheet = workbook.add_sheet("Sheet1")
sheet.write(0, 0, "Hello")
sheet.write(0, 1, "World")
workbook.save("example.xls")
xlwt is a library for writing data and formatting information to Excel files (xls), allowing you to create Excel spreadsheets using Python.
87. PyPDF2:
import PyPDF2
with open("example.pdf", "rb") as file:
reader = PyPDF2.PdfFileReader(file)
print(f"Number of pages: {reader.numPages}")
page = reader.getPage(0)
print(page.extractText())
PyPDF2 is a library for working with PDF files, allowing you to extract text, metadata, and other information from PDF documents using Python.
88. pdfminer:
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
print(text)
pdfminer is a library for extracting text, metadata, and other information from PDF files, providing a more advanced and customizable interface for working with PDF documents using Python.
89. pytesseract:
from PIL import Image
import pytesseract
image = Image.open("example.png")
text = pytesseract.image_to_string(image)
print(text)
pytesseract is an OCR (Optical Character Recognition) library for Python, allowing you to extract text from images using the Tesseract OCR engine.
90. Pillow:
from PIL import Image
image = Image.open("example.jpg")
image.thumbnail((100, 100))
image.save("thumbnail.jpg")
Pillow is a fork of the Python Imaging Library (PIL) that provides extensive file format support, an efficient internal representation, and powerful image processing capabilities for Python.
91. holoviews:
import holoviews as hv
from bokeh.resources import INLINE
hv.extension("bokeh", logo=False)
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
curve = hv.Curve((x, y), "x", "y")
curve.opts(width=400, height=300, line_color="red")
HoloViews is a high-level visualization library for creating interactive plots with concise expressions, allowing you to create complex and flexible visualizations without writing large amounts code.
92. dask:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
result = z.compute()
print(result)
Dask is a parallel computing library for Python that allows you to parallelize operations on large data structures like arrays, dataframes, and lists, providing an alternative to NumPy, pandas, and other libraries for handling large-scale data.
93. pyarrow:
import pyarrow as pa
import pyarrow.parquet as pq
data = pa.Table.from_pandas(pd.DataFrame({"A": range(5), "B": range(5, 10)}))
pq.write_table(data, "example.parquet")
PyArrow is a cross-language development platform for in-memory data, providing tools for working with Apache Arrow, a standardized columnar memory format for high-performance analytics.
94. sympy:
from sympy import symbols, Eq, solve
x, y = symbols("x y")
eq1 = Eq(3 * x + 4 * y, 12)
eq2 = Eq(x - y, 2)
solutions = solve((eq1, eq2), (x, y))
print(solutions)
SymPy is a Python library for symbolic mathematics, allowing you to perform algebraic manipulations, calculus, linear algebra, and more using symbolic expressions.
95. redis:
import redis
r = redis.Redis(host="localhost", port=6379, db=0)
r.set("name", "John Doe")
print(r.get("name"))
redis-py is a Python client for Redis, a high-performance in-memory data store, providing an easy-to-use interface for working with Redis data structures like strings, hashes, lists, sets, and sorted sets.
96. lxml:
from lxml import etree
xml_data = "<root><element>text</element></root>"
root = etree.fromstring(xml_data)
element = root.find("element")
print(element.text)
lxml is a library for processing XML and HTML in Python, providing a fast and easy-to-use interface for parsing, validating, and manipulating XML and HTML documents.
97. Opencv:
import cv2
img = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)
resized_img = cv2.resize(img, (100, 100))
cv2.imwrite('resized_image.jpg', resized_img)
OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library, providing a wide range of functionalities for image and video processing, including object detection, feature extraction, and image transformation.
98. IMDbpy:
from imdb import IMDb
ia = IMDb()
movie = ia.get_movie('0133093') # The Matrix (1999)
print(movie.summary())
IMDbPY is a Python package for accessing the IMDb’s movie database, providing an easy-to-use interface for retrieving information about movies, people, characters, and companies.
99. Hugging Face Transformers:
from transformers import pipeline
summarizer = pipeline("summarization")
text = "Hugging Face is a company based in New York and Paris that provides state-of-the-art natural language processing models."
summary = summarizer(text, max_length=25, min_length=5, do_sample=False)
print(summary[0]["summary_text"])
Hugging Face Transformers is a library for working with state-of-the-art natural language processing models, such as BERT, GPT, and RoBERTa, providing an easy-to-use interface for tasks like text classification, summarization, and generation.
100. joblib:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from joblib import dump, load
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
dump(model, "random_forest_digits.joblib")
loaded_model = load("random_forest_digits.joblib")
print("Loaded model score:", loaded_model.score(X_test, y_test))
joblib is a library for saving and loading trained models, as well as parallelizing tasks for faster computation.
101. SQLAlchemy:
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
engine = create_engine('sqlite:///mydb.sqlite3')
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
- SQL toolkit and Object-Relational Mapper (ORM)
- Simplifies database operations
102. MLFlow:
import mlflow
mlflow.start_run()
mlflow.log_param("param_name", param_value)
mlflow.log_metric("metric_name", metric_value)
mlflow.end_run103.ensorFlow Extended (TFX)
End-to-end platform for deploying production ML pipelines
Code snippet: import tensorflow_data_validation as tfdv
Offers a set of libraries for data validation, transformation, and serving
Integrates with TensorFlow for machine learning model training()
- The platform for managing the ML lifecycle
- Provides tools for tracking experiments, packaging code, and sharing results
103. TensorFlow Extended (TFX):
- End-to-end platform for deploying production ML pipelines
- Code snippet: import tensorflow_data_validation as tfdv
- Offers a set of libraries for data validation, transformation, and serving
- Integrates with TensorFlow for machine learning model training
104. Prefect:
- Workflow management system for building, scheduling, and monitoring data pipelines
- Code snippet: from prefect import task, Flow
- Provides a Pythonic way to define tasks and their dependencies
- Supports distributed execution and offers a UI for monitoring
105. Kedro:
- Open-source framework for creating reproducible, maintainable, and modular data science code
- Code snippet: from kedro.pipeline import Pipeline
- Facilitates organizing code into pipelines and nodes
- Integrates with various data processing and ML libraries
106. Apache Airflow:
- Workflow management platform for scheduling and monitoring data pipelines
- Code snippet: from airflow import DAG
- Allows creating, scheduling, and monitoring workflows using Directed Acyclic Graphs (DAGs)
- Supports a wide variety of operators for integrating various services
107. PyTorch Lightning:
- Lightweight PyTorch wrapper for high-performance AI research
- Code snippet: import pytorch_lightning as pl
- Simplifies training, evaluation, and deployment of PyTorch models
- Offers built-in support for distributed training and mixed-precision
108. Optuna:
- Automatic hyperparameter optimization framework
- Code snippet: import optuna
- Supports various optimization algorithms
- Offers easy integration with popular ML libraries
109. Ray:
- Distributed computing framework for parallel and distributed Python applications
- Code snippet: import ray
- Enables scaling and parallelizing Python applications easily
- Provides a simple API for implementing distributed algorithms
110. ONNX:
- Open Neural Network Exchange, a format for interchangeable AI models
- Code snippet: import onnx
- Allows models to be trained in one framework and run in another
Supports
111. Scikit-image: Image processing library:
from skimage import io, filters
image = io.imread('image.png')
edges = filters.sobel(image)
112. Celery: A distributed task queue for Python:
from celery import Celery
app = Celery('tasks', broker='pyamqp://guest@localhost//')
@app.task
def add(x, y):
return x + y
113. TensorBoard: A visualization toolkit for TensorFlow:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
writer.add_scalar("Training loss", loss, global_step=step)
writer.close()Conclusion:
114. Boto3: The Amazon Web Services (AWS) SDK for Python:
import boto3
s3 = boto3.resource
115. Click: A library for creating beautiful command-line interfaces:
import click
@click.command()
@click.option('--count', default=1, help='Number of greetings.')
def hello(count):
for _ in range(count):
click.echo('Hello, World!')
116. Keras-tuner:
A library for hyperparameter tuning of Keras models:
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(
build_model,
objective='val_loss',
max_trials=5,
executions_per_trial=3,
directory='my_dir',
project_name='helloworld')
117. Imbalanced-learn:
A Python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance.
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
118. Tqdm:
A fast, extensible progress bar for loops and other iterable objects.
from tqdm import tqdm
import time
for i in tqdm(range(10)):
time.sleep(0.1)
119. Tabulate:
A library for pretty-printing tabular data.
from tabulate import tabulate
table = [["Alice", 24], ["Bob", 19]]
headers = ["Name", "Age"]
print(tabulate(table, headers=headers))
120. Pendulum:
A library to work with dates and times more easily.
import pendulum
now = pendulum.now()
print(now.to_date_string())
121. PuLP:
A linear programming library in Python.
from pulp import LpProblem, LpVariable, LpMaximize
prob = LpProblem("My Problem", LpMaximize)
x = LpVariable("x", 0, 4)
y = LpVariable("y", -1, 1)
122. Graphviz:
A library for creating, manipulating, and rendering graphs.
from graphviz import Digraph
g = Digraph('G', filename='hello.gv')
g.edge('Hello', 'World')
g.view()
123. PySpark:
The Python library for Spark, an open-source, distributed computing system that provides high-level API for distributed data processing.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
124. Elasticsearch-py:
The official Elasticsearch client library for Python.
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
125. Shap:
A library to explain the output of any machine learning model using Shapley values.
import shap
explainer = shap.Explainer(model)
shap_values = explainer(X_test)
Conclusion:
In conclusion, Python’s vast array of modules and packages showcases its adaptability and versatility across a wide range of applications, including data analysis, machine learning, web development, and more. With over 350,000 packages at their disposal, developers can leverage Python’s extensive ecosystem to efficiently tackle complex problems and accelerate their projects. This article has provided an overview of some of the most powerful and widely-used Python modules, demonstrating their potential in various domains.
References:
GitHub repositories for some of the modules mentioned in the article:
- Matplotlib: https://matplotlib.org/
- Seaborn: https://seaborn.pydata.org/
- Plotly: https://plotly.com/python/
- NumPy: https://numpy.org/
- Pandas: https://pandas.pydata.org/
- SciPy: https://www.scipy.org/
- Scikit-learn: https://scikit-learn.org/
- TensorFlow: https://www.tensorflow.org/
- Keras: https://keras.io/
- PyTorch: https://pytorch.org/
- NLTK: https://www.nltk.org/
- SpaCy: https://spacy.io/
- Gensim: https://radimrehurek.com/gensim/
- OpenCV: https://opencv.org/
- Requests: https://docs.python-requests.org/
- Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/
- Scrapy: https://scrapy.org/
- Flask: https://flask.palletsprojects.com/
- Django: https://www.djangoproject.com/
- SQLAlchemy: https://www.sqlalchemy.org/
- https://pypistats.org/top
- https://pypi.org/
- https://hugovk.github.io/top-pypi-packages/
- https://pypistats.org/
- Python Standard Library-https://docs.python.org/3/library/