Sign In
Sign In

How to Get the Current Directory in Python

How to Get the Current Directory in Python
Anees Asghar
Technical writer
Python
04.02.2025
Reading time: 7 min

In Python, processing files and folders is a frequent activity. A typical prerequisite is defining the current working directory (CWD), which indicates the path where your Python code runs. Therefore, comprehending how to fetch CWD is essential for file management since Python interprets file paths relative to this location. Additionally, you may need to identify the folder holding a script, particularly when operating programs that process files from distinct locations.

In this write-up, we’ll study diverse techniques for fetching the active directory in Python. For a profound experience, we’ll provide practical examples and address potential issues you may face during the process.

What Does 'Current Working Directory' Mean?

It refers to the path where a Python code runs. All file paths within the script rely on this folder unless specified otherwise. Comprehending how to locate and process the CWD is essential, especially when performing tasks like reading or storing data.

Fetching the Active Directory

Python offers numerous approaches to fetch the active directory. Let’s demonstrate each approach practically with its pros and cons:

Approach 1: Through os.getcwd()

This function offers a simple approach to fetch the active working directory. It extracts the folder from which the script is executed. While this technique is user-friendly and performs well in many cases, it may not be suitable when operating scripts from various locations, as it only fetches the CWD rather than the script’s actual locale. Additionally, it can behave differently across platforms, depending on the differences in file path handling.

Let’s utilize the getcwd() function through the os module to fetch the active directory:

import os
print("CWD ⇒ ", os.getcwd())

It retrieves C:\Users\HP\Documents as CWD:

Image1

Approach 2: Utilizing Path.cwd()

pathlib is a contemporary module that presents a structured, object-oriented approach to managing filesystem paths. The Path.cwd() function, available in pathlib, retrieves the current working directory as a Path object. This method is often considered clearer and more user-friendly than traditional os module functions. It also incorporates features for effortless path processing, making it a preferable option for controlling file paths in Python. However, since it yields a Path object, transforming it into a string could be required in certain situations.

To implement this function, commence by importing the Path class:

from pathlib import Path
print("CWD ⇒ ", Path.cwd())

We employ the Path class to run the cwd() method, which fetches the recent working folder:

Image3

Approach 3: Through sys.argv[0]

If we need to identify the folder where the Python scripts are located, instead of the active working directory, we can employ sys.argv[0]. This holds the scripts’ execution location. We can invoke it alongside the os.path.abspath() function to derive the script’s absolute directory location. This procedure guarantees a whole path, making it particularly beneficial when processing files corresponding to the script itself instead of depending on the active working directory.

import os 
import sys
scriptDirectory = os.path.dirname(os.path.abspath(sys.argv[0]))
print("CWD  ⇒ ", scriptDirectory )

In this instance, we employ os.path.abspath() alongside sys.argv[0] to fetch the entire directory path of the executing script:

Image2

Approach 4: Utilizing Inspect Module

The inspect module lets us fetch the directory of the running Python script by employing inspect.getfile(inspect.currentframe()) alongside os.path.dirname(os.path.abspath()). This technique is especially helpful when identifying the scripts’ precise location at runtime, making it significant for troubleshooting or handling nested modules in larger frameworks. While it is more complicated than simpler alternatives like os.getcwd() or __file__, it offers higher accuracy in identifying the scripts’ path. However, this approach yields minor performance overhead due to additional function calls.

Let’s invoke the desired functions from their respective modules/classes to fetch the current script’s path:

import inspect
import os
currentScriptPath = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
print("CWD ⇒", currentScriptPath)

This code first fetches the script’s file location through inspect.getfile(inspect.currentframe()), then converts it into an absolute path and derives the folder by applying os.path.dirname():

Image5

Approach 5: Through os.path.realpath()

It determines Symlinks in a file path and fetches the absolute, canonical site of the specified file. We can appropriately define the actual script path by employing the __file__ variable alongside os.path.realpath(), even if it’s been symlinked elsewhere. This renders it particularly beneficial in cases requiring precise file paths, such as loading resources corresponding to the script.

However, it may not function appropriately in environments where __file__ is unavailable (e.g., certain interactive environments like IDLE), and its reliance on __file__ can sometimes confuse beginners. Additionally, while it resolves the script's location, it doesn’t directly retrieve CWD unless employed with other functions. Despite these limitations, it’s a dependable way to extract the exact location of a Python script.

Let’s call dirname() alongside the __file__ variable to fetch the desired path:

import os
print(f"CWD: {os.path.realpath(os.path.dirname(__file__))}")

When implementing this code, you might come across the “_file_ is not defined” error, as this variable is not always accessible in certain environments. To prevent this issue, save the code as a .py file (e.g., exampleScript.py) and run it from the terminal:

Image4

Troubleshooting Typical Problems

You may encounter some challenges when implementing various techniques to fetch the active directory (CWD) or the scripts’ path in Python. Below are typical difficulties associated with each approach and their fixes:

os.getcwd()

It fetches the recent working folder in place of the script’s path, which can lead to confusion when manipulating scripts from distinct folders.

Fix: Employ this process only when the CWD is required. For fetching the scripts’ location, consider alternative approaches like os.path.realpath() or sys.argv[0].

Path.cwd()

It fetches a Path object rather than a string, which might require conversion for compatibility with certain functions.

Fix: Convert the Path object to a string employing str(Path.cwd()) when needed.

sys.argv[0]

It gives the script’s path but may not function correctly if the script is run indirectly or if the path changes during execution.

Fix: You must run the script directly and always employ os.path.abspath() alongside sys.argv[0] to fetch the complete path.

inspect Module

It is more complex and may introduce minor performance overhead due to additional function calls.

Fix: Employ this approach in advanced scenarios where runtime accuracy is critical, such as debugging or handling nested modules.

os.path.realpath()

It relies on the _file_ variable, which is unavailable in specific environments (IDEs) like Jupyter Notebook or IDLE.

Fix: Run the script from a .py file in the terminal to guarantee that _file_ is specified. For interactive environments, fallback to os.getcwd() if the script’s path is not necessary.

Final Thoughts

In this write-up, we demonstrated diverse methods for locating the active working directory (CWD) in Python. We examined approaches like os.getcwd(), Path.cwd(), sys.argv[0], inspect, and os.path.realpath(), highlighting their benefits and appropriate use cases. Each method performs best for distinct situations, such as fetching the CWD or finding where a script is kept. We also discussed common problems you might face with these techniques and shared simple fixes. By using these techniques, users can easily manipulate file paths and directories in Python.

Python
04.02.2025
Reading time: 7 min

Similar

Python

How to Create and Set Up a Telegram Chatbot

Chatbots are software programs that simulate communication with users. Today, we use them for a wide range of purposes, from simple directories to complex services integrated with CRM systems and payment platforms. People create bots for Telegram, Viber, Facebook Messenger, and other messaging platforms. Each platform has its own rules and capabilities—some lack payment integration, while others don't support flexible keyboards. This article focuses on user-friendly Telegram, which has a simple API and an active audience. In this article, we will cover: How to create a Telegram bot on your own When it's convenient to use chatbot builders for development How to integrate a chatbot with external services and APIs What is needed for the bot to function smoothly The key features of Aiogram, a popular Python library for chatbot development Creating a Telegram Chatbot Without Programming Skills Chatbot builders are becoming increasingly popular. These services allow you to create a bot using a simple "drag-and-drop" interface. No programming knowledge is required—you just build logic blocks like in a children's game. However, there are some drawbacks to using chatbot builders: Limited functionality. Most chatbot builders provide only a portion of Telegram API's capabilities. For example, not all of them allow integration with third-party services via HTTP requests. Those that do often have expensive pricing plans. Generic scenarios. The minimal flexibility of builders leads to chatbots that look and function similarly. Dependence on the service. If the platform goes offline or its pricing increases, you may have to migrate your bot elsewhere. Builders are useful for prototyping and simple use cases—such as a welcome message, answering a few questions, or collecting contact information. However, more complex algorithms require knowledge of variables, data processing logic, and the Telegram API. Even when using a builder, you still need to understand how to address users by name, how inline keyboards work, and how to handle bot states. Free versions of chatbot builders often come with limitations: They may include advertising messages. Some prevent integration with essential APIs. Others impose limits on the number of users. These restrictions can reduce audience engagement, making the chatbot ineffective. In the long run, premium versions of these builders can end up costing more than developing a bot from scratch and hosting it on your own server. If you need a chatbot to handle real business tasks, automate processes, or work with databases, builders are often not sufficient. In such cases, hiring a developer is a better solution. A developer can design a flexible architecture, choose optimal technologies, and eliminate technical constraints that might hinder the project's scalability. If you already have a prototype built with a chatbot builder, you can use its logic as a starting point for technical specifications. How to Create a Telegram Chatbot Now, let's discuss how to create a Telegram chatbot using Python. You’ll need basic knowledge of variables, conditional statements, loops, and functions in Python. To create chatbots, you can use a framework which is a set of tools, libraries, and ready-made solutions that simplify software development. You can work with the raw Telegram API and implement functionality using HTTP requests, but even for simple tasks, this approach requires writing thousands of lines of code. In this guide, we’ll use Aiogram, one of the most popular frameworks for building Telegram chatbots in Python. Step 1: Create a Virtual Environment for Your Project Using a virtual environment in any Python project is considered good practice. Additionally, chatbots are often deployed on cloud servers where dependencies need to be installed. A virtual environment makes it easy to export a list of dependencies specific to your project. Install the Python virtual environment: sudo apt install python3-venv -y Create a virtual Python environment in the working directory: python -m venv venv Activate the environment: source ./venv/bin/activate Step 2: Install Required Libraries Install the Aiogram framework using pip: pip install aiogram Add a library for working with environment variables. We recommend this method for handling tokens in any project, even if you don’t plan to make it public. This reduces the risk of accidentally exposing confidential data. pip install python-dotenv You can also install any other dependencies as needed. Step 3: Initialize Your Chatbot via BotFather This is a simple step, but it often causes confusion. We need to interact with a Telegram bot that will generate and provide us with a token for our project. Open Telegram and start a chat with @BotFather. Click the Start button. The bot will send a welcome message. Enter the following command: /newbot BotFather will ask for a name for your bot—this is what users will see in their chat list. Then, enter a username for your bot. It must be unique and end with "bot" (e.g., mycoolbot). Once completed, BotFather will create your chatbot, assign it a username, and provide you with a token. Keep your token secret. Anyone with access to it can send messages on behalf of your chatbot. If your token is compromised, immediately generate a new one via BotFather. Next, open a chat with your newly created bot and configure the following: Click the Edit button. Update the profile picture. Set a welcome message. Add a description. Configure default commands. Step 4: Store Your Token Securely Create an environment file named .env (this file has no name, only an extension). Add the following line: BOT_TOKEN = your_generated_token On Linux and macOS, you can quickly save the token using the following command: echo "BOT_TOKEN = your_generated_token" > .env Step 4: Create the Script In your working directory, create a file called main.py—this will be the main script for your chatbot. Now, import the following test code, which will send a welcome message to the user when they enter the /start command: import asyncio # Library for handling asynchronous code import os # Module for working with environment variables from dotenv import load_dotenv # Function to load environment variables from the .env file from aiogram import Bot, Dispatcher, Router # Import necessary classes from aiogram from aiogram.types import Message # Import Message class for handling incoming messages from aiogram.filters import CommandStart # Import filter for handling the /start command # Create a router to store message handlers router = Router() # Load environment variables from .env load_dotenv() # Handler for the /start command @router.message(CommandStart()) # Filter to check if the message is the /start command async def cmd_start(message: Message) -> None: # Retrieve the user's first name and last name (if available) first_name = message.from_user.first_name last_name = message.from_user.last_name or "" # If no last name, use an empty string # Send a welcome message to the user await message.answer(f"Hello, {first_name} {last_name}!") # Main asynchronous function to start the bot async def main(): # Create a bot instance using the token from environment variables bot = Bot(token=os.getenv("BOT_TOKEN")) # Create a dispatcher to handle messages dp = Dispatcher() # Include the router with command handlers dp.include_router(router) # Start the bot in polling mode await dp.start_polling(bot) # If the script is run directly (not imported as a module), # execute the main() function if __name__ == "__main__": asyncio.run(main()) The script is well-commented to help you understand the essential parts.If you don't want to dive deep, you can simply use Dispatcher and Router as standard components in Aiogram. We will explore their functionality later in this guide. This ready-made structure can serve as a solid starting point for any chatbot project. As you continue development, you will add more handlers, keyboards, and states. Step 5: Run and Test the Chatbot Now, launch your script using the following command: python main.py Now you can open a chat with your bot in Telegram and start interacting with it. Aiogram Framework v3.x Features Overview  You only need to understand a few key components and functions of Aiogram to create a Telegram chatbot. This section covers Aiogram v3.x, which was released on September 1, 2023. Any version starting with 3.x will work. While older projects using Aiogram 2.x still exist, version 2.x is now considered outdated. Key Components of Aiogram Bot The Bot class serves as the interface to the Telegram API. It allows you to send messages, images, and other data to users. bot = Bot(token=os.getenv("TOKEN")) You can pass the token directly when initializing the Bot class, but it's recommended to use environment variables to prevent accidental exposure of your bot token. Dispatcher The Dispatcher is the core of the framework. It receives updates (incoming messages and events) and routes them to the appropriate handlers. dp = Dispatcher() In Aiogram v3, a new structure with Router is used (see below), but the Dispatcher is still required for initialization and launching the bot. Router In Aiogram v3, handlers are grouped within a Router. This is a separate entity that stores the bot's logic—command handlers, message handlers, callback handlers, and more. from aiogram import Router router = Router() After defining handlers inside the router, developers register it with the Dispatcher: dp.include_router(router) Handling Commands The most common scenario is responding to commands like /start or /help. from aiogram import F from aiogram.types import Message @router.message(F.text == "/start") async def cmd_start(message: Message): await message.answer("Hello! I'm a bot running on Aiogram.") F.text == "/start" is a new filtering method in Aiogram v3. message.answer(...) sends a reply to the user. Handling Regular Messages To react to any message, simply remove the filter or define a different condition: @router.message() async def echo_all(message: Message): await message.answer(f"You wrote: {message.text}") In this example, the bot echoes whatever text the user sends. Inline Buttons and Keyboards from aiogram.types import InlineKeyboardButton, InlineKeyboardMarkup inline_kb = InlineKeyboardMarkup( inline_keyboard=[ [InlineKeyboardButton(text="Click me!", callback_data="press_button")] ] ) @router.message(F.text == "/buttons") async def show_buttons(message: Message): await message.answer("Here are my buttons:", reply_markup=inline_kb) When the user clicks the button, the bot receives callback_data="press_button", which can be handled separately: from aiogram.types import CallbackQuery @router.callback_query(F.data == "press_button") async def handle_press_button(callback: CallbackQuery): await callback.message.answer("You clicked the button!") await callback.answer() # Removes the "loading" animation in the chat Regular Buttons (Reply Keyboard) Regular buttons differ from inline buttons in that they replace the keyboard. The user immediately sees a list of available response options. These buttons are tracked by the message text, not callback_data. from aiogram.types import ReplyKeyboardMarkup, KeyboardButton, ReplyKeyboardRemove # Creating a reply keyboard reply_kb = ReplyKeyboardMarkup( keyboard=[ [ KeyboardButton(text="View Menu"), KeyboardButton(text="Place Order") ] ], resize_keyboard=True # Automatically adjusts button size ) # Handling the /start command and showing the reply keyboard @router.message(F.text == "/start") async def start_cmd(message: Message): await message.answer( "Welcome! Choose an action:", reply_markup=reply_kb ) # Handling "View Menu" button press @router.message(F.text == "View Menu") async def show_menu(message: Message): await message.answer("We have pizza and drinks.") # Handling "Place Order" button press @router.message(F.text == "Place Order") async def make_order(message: Message): await message.answer("What would you like to order?") # Command to hide the keyboard @router.message(F.text == "/hide") async def hide_keyboard(message: Message): await message.answer("Hiding the keyboard", reply_markup=ReplyKeyboardRemove()) Filters and Middlewares Filters Filters help define which messages should be processed. You can also create custom filters. from aiogram.filters import Filter # Custom filter to check if a user is an admin class IsAdmin(Filter): def __init__(self, admin_id: int): self.admin_id = admin_id async def __call__(self, message: Message) -> bool: return message.from_user.id == self.admin_id # Using the filter to restrict a command to the admin @router.message(IsAdmin(admin_id=12345678), F.text == "/admin") async def admin_cmd(message: Message): await message.answer("Hello, Admin! You have special privileges.") Middlewares Middlewares act as intermediary layers between an incoming request and its handler. You can use them to intercept, modify, validate, or log messages before they reach their respective handlers. import logging from aiogram.types import CallbackQuery, Message from aiogram.dispatcher.middlewares.base import BaseMiddleware # Custom middleware to log incoming messages and callbacks class LoggingMiddleware(BaseMiddleware): async def __call__(self, handler, event, data): if isinstance(event, Message): logging.info(f"[Message] from {event.from_user.id}: {event.text}") elif isinstance(event, CallbackQuery): logging.info(f"[CallbackQuery] from {event.from_user.id}: {event.data}") # Pass the event to the next handler return await handler(event, data) async def main(): load_dotenv() logging.basicConfig(level=logging.INFO) bot = Bot(token=os.getenv("BOT_TOKEN")) dp = Dispatcher() # Attaching the middleware dp.update.middleware(LoggingMiddleware()) dp.include_router(router) await dp.start_polling(bot) Working with States (FSM) in Aiogram 3 Aiogram 3 supports Finite State Machine (FSM), which is useful for step-by-step data collection (e.g., user registration, order processing). FSM is crucial for implementing multi-step workflows where users must complete one step before moving to the next. For example, in a pizza ordering bot, we need to ask the user for pizza size and delivery address, ensuring the process is sequential. We must save each step's data until the order is complete. Step 1: Declare States from aiogram.fsm.state import State, StatesGroup class OrderPizza(StatesGroup): waiting_for_size = State() waiting_for_address = State() These states define different stages in the ordering process. Step 2: Switch between states from aiogram.fsm.context import FSMContext @router.message(F.text == "/order") async def cmd_order(message: Message, state: FSMContext): # Create inline buttons for selecting pizza size size_keyboard = InlineKeyboardMarkup( inline_keyboard=[ [ InlineKeyboardButton(text="Large", callback_data="size_big"), InlineKeyboardButton(text="Medium", callback_data="size_medium"), InlineKeyboardButton(text="Small", callback_data="size_small") ] ] ) await message.answer( "What size pizza would you like? Click one of the buttons:", reply_markup=size_keyboard ) # Set the state to wait for the user to choose a size await state.set_state(OrderPizza.waiting_for_size) # Step 2: Handle button click for size selection @router.callback_query(OrderPizza.waiting_for_size, F.data.startswith("size_")) async def choose_size_callback(callback: CallbackQuery, state: FSMContext): # Callback data can be size_big / size_medium / size_small size_data = callback.data.split("_")[1] # e.g., "big", "medium", or "small" # Save the selected pizza size in the temporary state storage await state.update_data(pizza_size=size_data) # Confirm the button press (removes "loading clock" in Telegram's UI) await callback.answer() await callback.message.answer("Please enter your delivery address:") await state.set_state(OrderPizza.waiting_for_address) # Step 2a: If the user sends a message instead of clicking a button (in waiting_for_size state), # we can handle it separately. For example, prompt them to use the buttons. @router.message(OrderPizza.waiting_for_size) async def handle_text_during_waiting_for_size(message: Message, state: FSMContext): await message.answer( "Please select a pizza size using the buttons above. " "We cannot proceed without this information." ) # Step 3: User sends the delivery address @router.message(OrderPizza.waiting_for_address) async def set_address(message: Message, state: FSMContext): address = message.text user_data = await state.get_data() pizza_size = user_data["pizza_size"] size_text = { "big": "large", "medium": "medium", "small": "small" }.get(pizza_size, "undefined") await message.answer(f"You have ordered a {size_text} pizza to be delivered at: {address}") # Clear the state — the process is complete await state.clear() Notice how the temporary storage keeps track of user responses at each step. This storage is user-specific and does not require a database. The user progresses through a chain of questions, and at the end, the order details can be sent to an internal API.  Deploying the Bot: Running on a Server Let's go through two main deployment methods. Quick Method: Docker + Hostman App Platform This method does not require any system administration knowledge; the entire deployment process is automated. Additionally, it helps save costs. Follow these steps: Export all project dependencies to a requirements.txt file. Using a virtual environment is recommended to avoid pulling in libraries from the entire system. Run the following command in the project directory terminal: pip freeze > requirements.txt Add a deployment file to the project directory — Dockerfile. This file has no extension, just the name. Insert the following content: FROM python:3.11 WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 9999 CMD ["python", "main.py"] Create a Git repository and push it to GitHub. You can use a minimal set of Git commands from our guide by running these commands in sequence. Add the environment variables file (.env) to .gitignore to prevent it from being exposed publicly. Go to the Hostman control panel, select the App platform section, and click Create app. Go to the Docker tab and select Dockerfile. Link your GitHub account or connect your Git repository via URL. Select the repository from the list after linking your GitHub account. Choose a configuration. Hostman Apps offers a configuration of 1 CPU x 3.3GHz, 1GB RAM, NVMe storage, which is ideal for simple text-based bots, projects with small inline keyboards, basic FSM logic, low-demand API requests, working with SQLite, or lightweight JSON files. This configuration can handle 50-100 users per minute. Add the bot token to environment variables. In the App settings, click + Add, enter BOT_TOKEN as the key, and paste the token obtained from BotFather as the value. Start the deployment and wait for it to complete. Once finished, the bot will be up and running. Standard Method: Ubuntu + systemd Export all project dependencies to the requirements.txt file. Run the following command in the Terminal while in the project directory: pip freeze > requirements.txt Create a cloud server in the Hostman panel with the desired configuration and Ubuntu OS. Transfer project files to the directory on the remote server. The easiest way to do this is using the rsync utility if you're using Ubuntu/MacOS: rsync -av --exclude="venv" --exclude=".idea" --exclude=".git" ./ root@176.53.160.13:/root/project Don’t forget to replace the server IP and correct the destination directory.  Windows users can use FileZilla to transfer files.  Connect to the server via SSH. Install the package for virtual environments: sudo apt install python3.10-venv Navigate to the project directory where you transferred the files. Create a virtual environment and install the dependencies: python -m venv venvsource venv/bin/activatepip install -r requirements.txt Test the bot functionality by running it: python main.py If everything works, proceed to the next step. Create the unit file /etc/systemd/system/telegram-bot.service: sudo nano /etc/systemd/system/telegram-bot.service Add the following content to the file: [Unit] Description=Telegram Bot Service After=network.target [Service] User=root WorkingDirectory=/root/project ExecStart=/root/proj/venv/bin/python /root/proj/main.py Restart=always RestartSec=5 [Install] WantedBy=multi-user.target WorkingDirectory — the project directory ExecStart — the command to start the chatbot in the format <interpreter> <full path to the file>. If using a virtual environment, the path to the interpreter will be as in the example. If working without venv, use /usr/local/bin/python3. Reload systemd and enable the service: sudo systemctl daemon-reloadsudo systemctl enable telegram-bot.servicesudo systemctl start telegram-bot.service Check the status of the service and view logs if necessary: sudo systemctl status telegram-bot.service If the bot is running correctly, the Active field should show active (running). View bot logs: sudo journalctl -u telegram-bot.service -f Manage the service with the following commands: Restart the bot: sudo systemctl restart telegram-bot.service Stop the bot: sudo systemctl stop telegram-bot.service Remove the service (if needed): sudo systemctl disable telegram-bot.servicesudo rm /etc/systemd/system/telegram-bot.servicesudo systemctl daemon-reload Conclusion Creating a Telegram chatbot in Python is a task that can be accomplished even without programming experience using bot builders. However, if you need flexibility and more options, it's better to master the aiogram framework and deploy your own project. This gives you full control over the code, the ability to enhance functionality, manage integrations, and avoid the limitations of paid plans. To run the bot in production, simply choose an appropriate configuration on the Hostman App Platform and set up automatic deployment. Pay attention to security by storing the token in an environment variable and encrypting sensitive data. In the future, you can scale the bot, add webhook support, integrate payment systems and analytics systems, and work with ML models if AI features are required.
12 March 2025 · 18 min to read
Microservices

Database Connection in Python, Go, and JavaScript

Databases are an essential part of almost any project today. Database interactions are especially familiar to system and database administrators, DevOps/SRE professionals, and software developers. While administrators typically deploy one or multiple database instances and configure the necessary connection parameters for applications, developers need to connect directly to the database within their code. This article explores how to connect to databases using different programming languages. Prerequisites We will provide examples for connecting to MySQL, PostgreSQL, Redis, MongoDB, and ClickHouse databases using Python, Go, and JavaScript. To follow this guide, you will need: A database deployed on a server or in the cloud. Installed environments for Python, Go, and JavaScript, depending on your application programming language. Additionally for Python: pip installed. Additionally for JavaScript: Node.js and npm installed. Database Connection in Python MySQL and Python For connecting to MySQL databases, we can use a Python driver called MySQL Connector. Install the driver using pip: pip install mysql-connector-python Initialize a new connection: Import the mysql.connector library and the Error class to handle specific connection errors. Create a function named create_connection, passing the database address (host), user name (user), and user password (password). To establish the connection, define a class called create_connection that receives the variable names containing the database connection details. import mysql.connector from mysql.connector import Error def create_connection(host_name, user_name, user_password): connection = None try: connection = mysql.connector.connect( host="91.206.179.29", user="gen_user", password="m-EE6Wm}z@wCKe" ) print("Successfully connected to MySQL Server!") except Error as e: print(f"The error '{e}' occurred") return connection def execute_query(connection, query): cursor = connection.cursor() try: cursor.execute(query) connection.commit() print("Query executed successfully") except Error as e: print(f"The error '{e}' occurred") connection = create_connection("91.206.179.29", "gen_user", "m-EE6Wm}z@wCKe") Run the script. If everything works correctly, you will see the "Successfully connected to MySQL Server!" message. If any errors occur, the console will display error code and description. Create a new table: Connect to the database using the connection.database class, specifying the name of the database. Note that the database should already exist. To create a table, initialize a variable create_table_query containing the SQL CREATE TABLE query. For data insertion, initialize another variable insert_data_query with the SQL INSERT INTO query. To execute each query, use the execute_query class, which takes the database connection string and the variable containing the SQL query. connection.database = 'test_db' create_table_query = """ CREATE TABLE IF NOT EXISTS users ( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100) NOT NULL, age INT NOT NULL ) """ execute_query(connection, create_table_query) insert_data_query = """ INSERT INTO users (name, age) VALUES ('Alice', 30), ('Bob', 25) """ execute_query(connection, insert_data_query) if connection.is_connected(): connection.close() print("Connection closed") Run the script. PostgreSQL and Python Python offers several plugins for connecting to PostgreSQL, but the most popular one is psycopg2, which we will use here. Psycopg2 is one of the most frequently used Python plugins for PostgreSQL connections. One of its key advantages is its support for multithreading which allows you to maintain the database connection across multiple threads. Install psycopg2 using pip (if not already installed): pip install psycopg2-binary Connect to PostgreSQL. Import the Python psycopg2 package and create a function create_new_conn, using the try block. Establish the connection with the psycopg2.connect function, which requires the database name, user name, password, and database address as input. To initialize the connection, use the create_new_conn() function. Here’s the full code example for connecting to a database: import psycopg2 from psycopg2 import OperationalError def create_new_conn(): conn_to_postgres = None while not conn_to_postgres: try: conn_to_postgres = psycopg2.connect( default_db="default_db", default_user="gen_user", password_for_default_user="PasswordForDefautUser9893#", db_address="91.206.179.128" ) print("The connection to PostgreSQL has been successfully established!") except OperationalError as e: print(e) return conn_to_postgres conn_to_postgres = create_new_conn() Run the script: python3 connect_to_postgres.py If successful, you will see the "The connection to PostgreSQL has been successfully established!" message. . Next, create a table named books, which will have three columns. Use the cursor class for SQL expressions, such as creating database objects. If the query involves adding or modifying data, you must call the conn_to_postgres.commit() function afterward to apply the changes. import psycopg2 from psycopg2 import OperationalError def create_new_conn(): conn_to_postgres = None while not conn_to_postgres: try: conn_to_postgres = psycopg2.connect( default_db="default_db", default_user="gen_user", password_for_default_user="PasswordForDefautUser9893#", db_address="91.206.179.128" ) except OperationalError as e: print(e) return conn_to_postgres conn_to_postgres = create_new_conn() cursor = conn_to_postgres.cursor() cursor.execute(""" CREATE TABLE books ( book_id INT PRIMARY KEY NOT NULL, book_name VARCHAR(255) NOT NULL, book_author VARCHAR(255) NOT NULL ) """) conn_to_postgres.commit() print("Table Created successfully") Run the script: python3 create_table.py Now, let’s run INSERT INTO to add a new line: cursor.execute(""" INSERT INTO books (book_id,book_name,book_author) VALUES (1, 'Long Walk to Freedom', 'Nelson_Mandela') """) The full code is below: import psycopg2 from psycopg2 import OperationalError def create_new_conn(): conn_to_postgres = None while not conn_to_postgres: try: conn_to_postgres = psycopg2.connect( default_db="default_db", default_user="gen_user", password_for_default_user="PasswordForDefautUser9893#", db_address="91.206.179.128" ) except OperationalError as e: print(e) return conn_to_postgres conn_to_postgres = create_new_conn() cursor = conn_to_postgres.cursor() cursor.execute(""" INSERT INTO books (book_id,book_name,book_author) VALUES (1, 'Long Walk to Freedom', 'Nelson_Mandela') """) conn_to_postgres.commit() conn_to_postgres.close() print("Data inserted successfully") Run the script: python3 insert-data.py Redis and Python Redis belongs to the class of NoSQL databases, where data is stored in memory rather than on hard drives. It uses a key-value format for data storage. Redis has a wide range of applications, from data storage and caching to serving as a message broker. We will use the redis-py (or simply redis) library for connecting to Redis. Install the Redis library using pip: pip install redis Connecting to a Redis instance: Use a try block structure for connection, specifying the function redis.StrictRedis where you provide the Redis address, port, and user password. import redis try: connect_to_redis_server = redis.StrictRedis( redis_db_host=91.206.179.128, redis_db_port=6379, redis_user_password='PasswordForRedis6379') print connect_to_redis_server connect_to_redis_server.ping() print 'Successfully connected to Redis Server!' except Exception as ex: print 'Error:', ex exit('Failed to connect to Redis server.') Run the script: python3 connect_to_redis.py If successful, you will see a message like "Successfully connected to Redis Server!". Unlike relational databases, Redis stores data in a key-value format. The key uniquely identifies the corresponding value. Use the set method to create a new record. The example below creates a record with the key City and the value Berlin: print('Create new record:', connect_to_redis_server.set("City", "Berlin")) Use the get method to retrieve the value associated with a key: print('Print record using record key:', connect_to_redis_server.get("City")) Use the delete method to remove a record by its key: print('Delete record with key:', connect_to_redis_server.delete("City")) The complete code fragment is below. import redis try: connect_to_redis_server = redis.StrictRedis( redis_db_host=91.206.179.128, redis_db_port=6379, redis_user_password='PasswordForRedis6379') print ('New record created:', connect_to_redis_server.set("City", "Berlin")) print ('Print created record using record key', connect_to_redis_server.get("City")) print ('Delete created record with key :', connect_to_redis_server.delete("City")) except Exception as ex: print ('Error:', ex) MongoDB and Python MongoDB is another widely used NoSQL database that belongs to the document-oriented category. Data is organized as JSON-like documents. To connect to a MongoDB database with Python, the recommended library is PyMongo, which provides a synchronous API. Install the PyMongo plugin: pip3 install pymongo Connect to MongoDB server using the following Python code. Import the pymongo module and use the MongoClient class to specify the database server address. To establish a connection to the MongoDB server, use a try block for error handling: import pymongo connect_to_mongo = pymongo.MongoClient("mongodb://91.206.179.29:27017/") first_db = connect_to_mongo["mongo-db1"] try: first_db.command("serverStatus") except Exception as e: print(e) else: print("Successfully connected to MongoDB Server!") connect_to_mongo.close() Run: python3 connect_mongodb.py If the connection is successfully established, the script will return the message: "Successfully connected to MongoDB Server!" Add data to MongoDB. To add data, you need to create a dictionary. Let's create a dictionary named record1, containing three keys: record1 = { "name": "Alex", "age": 25, "location": "London" } To insert the dictionary data, use the insert_one method in MongoDB. insertrecord = collection1.insert_one(record1) import pymongo connect_to_mongo = pymongo.MongoClient("mongodb://91.206.179.29:27017/") db1 = connect_to_mongo["newdb"] collection1 = db1["userdata"] record1 = { "name": "Alex", "age": 25, "location": "London" } insertrecord = collection1.insert_one(record1) print(insertrecord) Run the script: python3 connect_mongodb.py ClickHouse and Python ClickHouse is a columnar NoSQL database where data is stored in columns rather than rows. It is widely used for handling analytical queries. Install the ClickHouse driver for Python. There is a dedicated plugin for ClickHouse called clickhouse-driver. Install the driver using the pip package manager: pip install clickhouse-driver Connect to ClickHouse. To initialize a connection with ClickHouse, you need to import the Client class from the clickhouse_driver library. To execute SQL queries, use the client.execute function. You also need to specify the engine. For more details on supported engines in ClickHouse, you can refer to the official documentation. We'll use the default engine, MergeTree. Next, create a new table called users and insert two columns with data. To list the data to be added to the table, use the tuple data type. After executing the necessary queries, make sure to close the connection to the database using the client.disconnect() method. The final code will look like this: from clickhouse_driver import Client client = Client(host=91.206.179.128', user='root', password='P@$$w0rd123', port=9000) client.execute(''' CREATE TABLE IF NOT EXISTS Users ( id UInt32, name String, ) ENGINE = MergeTree() ORDER BY id ''') data = [ (1, 'Alice'), (2, 'Mary') ] client.execute('INSERT INTO Users (id, name) VALUES', data) result = client.execute('SELECT * FROM Users') for row in result: print(row) client.disconnect() Database Connection in Go Go is one of the youngest programming languages, developed in 2009 by Google.  It is widely used in developing microservice architectures and network utilities. For example, services like Docker and Kubernetes are written in Go. Go supports integrating all popular databases, including PostgreSQL, Redis, MongoDB, MySQL, ClickHouse, etc. MySQL and Go For working with the MySQL databases in Go, use the go-sql-driver/mysql driver. Create a new directory for storing project files and navigate into it: mkdir mysql-connect && cd mysql-connect Create a go.mod file to store the dependencies: go mod init golang-connect-mysql Download the MySQL driver using the go get command: go get -u github.com/go-sql-driver/mysql Create a new file named main.go. Specify the database connection details in the dsn variable: package main import ( "database/sql" "fmt" "log" _ "github.com/go-sql-driver/mysql" ) func main() { dsn := "root:password@tcp(localhost:3306)/testdb" db, err := sql.Open("mysql", dsn) if err != nil { log.Fatal(err) } defer db.Close() if err := db.Ping(); err != nil { log.Fatal(err) } fmt.Println("Successfully connected to the database!") query := "INSERT INTO users (name, age) VALUES (?, ?)" result, err := db.Exec(query, "Alex", 25) if err != nil { log.Fatal(err) } lastInsertID, err := result.LastInsertId() if err != nil { log.Fatal(err) } fmt.Printf("Inserted data with ID: %d\n", lastInsertID) } PostgreSQL and Go To connect to PostgreSQL, use the pq driver. Before installing the driver, let's prepare our environment. Create a new directory for storing the project files and navigate into it: mkdir postgres-connect && cd postgres-connect Since we will be working with dependencies, we need to create a go.mod file to store them: go mod init golang-connect-postgres Download the pq driver using the go get command: go get github.com/lib/pq Create a new file named main.go. In addition to importing the pq library, it is necessary to add the database/sql library as Go does not come with official database drivers by default. The database/sql library consists of general, independent interfaces for working with databases. It is also important to note the underscore (empty identifier) when importing the pq module: _ "github.com/lib/pq" The empty identifier is used to avoid the "unused import" error, as in this case, we only need the driver to be registered in database/sql. The fmt package is required to output data to the standard output stream, for example, to the console. To open a connection to the database, the sql.Open function is used, which takes the connection string (connStr) and the driver name (postgres). The connection string specifies the username, database name, password, and host address: package main import ( "database/sql" "fmt" "log" _ "github.com/lib/pq" ) func main() { connStr := "user=golang dbname=db_for_golang password=Golanguserfordb0206$ host=47.45.249.146 sslmode=disable" db, err := sql.Open("postgres", connStr) if err != nil { log.Fatal(err) } defer db.Close() err = db.Ping() if err != nil { log.Fatal(err) } fmt.Println("Successfully connected to PostgreSQL!") } Compile and run: go run main.go If everything works correctly, the terminal will display the message Successfully connected to PostgreSQL! Now, let's look at an example of how to insert data into a table.  First, we need to create a table in the database. When using Hostman cloud databases, you can copy the PostgreSQL connection string displayed in the "Connections" section of the Hostman web interface. Make sure that the postgresql-client utility is installed on your device beforehand. Enter the psql shell and connect to the previously created database: \c db_for_golang Create a table named Cities with three fields — city_id, city_name, and city_population: CREATE TABLE Cities ( city_id INT PRIMARY KEY, city_name VARCHAR(45) NOT NULL, city_population INT NOT NULL); Grant full privileges to the created table for the user: GRANT ALL PRIVILEGES ON TABLE cities TO golang; The function db.Prepare is used to prepare data. It specifies the query for insertion in advance. To insert data, use the function stmt.Exec. In Go, it's common to use plain SQL without using the ORM (Object-Relational Mapping) approach. stmt, err := db.Prepare("INSERT INTO Cities(city_id, city_name, city_population) VALUES($1, $2, $3)") if err != nil { log.Fatal(err) } defer stmt.Close() _, err = stmt.Exec(1, "Toronto", 279435) if err != nil { log.Fatal(err) } fmt.Println("Data inserted successfully!") } If all works correctly, you will see: Data inserted successfully! Redis and Go To connect to Redis, you need to use the go-redis driver. Сreate a new directory: mkdir connect-to-redis && cd connect-to-redis Prepare the dependency file: go mod init golang-connect-redis And optimize them: go mod tidy Download the go-redis module: go get github.com/go-redis/redis/v8 To connect to Redis, use the redis.Options function to specify the address and port of the Redis server. Since Redis does not use authentication by default, you can leave the Password field empty and use the default database (database 0): package main import ( "context" "fmt" "log" "github.com/go-redis/redis/v8" ) func main() { rdb := redis.NewClient(&redis.Options{ Addr: "91.206.179.128:6379", Password: "", DB: 0, }) ctx := context.Background() _, err := rdb.Ping(ctx).Result() if err != nil { log.Fatalf("Couldn't connect to Redis: %v", err) } fmt.Println("Successfully connected to Redis!") } You should see the message «Successfully connected to Redis!» MongoDB and Go To work with MongoDB, we'll use the mongo driver. Create a new directory to store the project structure: mkdir connect-to-mongodb && cd connect-to-mongodb Initialize the dependency file: go mod init golang-connect-mongodb Download the mongo library: go get go.mongodb.org/mongo-driver/mongo Connect to MongoDB using the options.Client().ApplyURI method. It takes a connection string such as mongodb://91.206.179.29:27017, where 91.206.179.29 is the MongoDB server address and 27017 is the port for connecting to MongoDB. The options.Client().ApplyURI string is used only for specifying connection data. To check the connection status, you can use another function, client.Ping, which shows the success or failure of the connection: package main import ( "context" "fmt" "log" "time" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { clientOptions := options.Client().ApplyURI("mongodb://91.206.179.29:27017") client, err := mongo.Connect(context.TODO(), clientOptions) if err != nil { log.Fatalf("Couldn't connect to MongoDB server: %v", err) } fmt.Println("successfully connected to MongoDB!") ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() err = client.Ping(ctx, nil) if err != nil { log.Fatalf("Could not ping MongoDB server: %v", err) } fmt.Println("Ping MongoDB server successfully!") } You should see the message: successfully connected to MongoDB!Ping MongoDB server successfully MongoDB uses collections to store data. You can create collections using the .Collection function.  Below, we will create a database called first-database and a collection called first-collection. The collection will have a new document, containing three keys: user-name, user-age, and user-email. collection := client.Database("first-database").Collection("first-collection") document := map[string]interface{}{ "user-name": "Alice", "user-age": 25, "user-email": "alice@corporate.com", } insertResult, err := collection.InsertOne(ctx, document) if err != nil { log.Fatalf("Couldn't insert new document: %v", err) } fmt.Printf("Inserted new document with ID: %v\n", insertResult.InsertedID) if err := client.Disconnect(ctx); err != nil { log.Fatalf("Could not disconnect from MongoDB: %v", err) } fmt.Println("Disconnected from MongoDB!") } If successful, you will see the Inserted new document message with the document ID.  ClickHouse and Go To work with ClickHouse, use the clickhouse-go driver. Create a new directory to store the project files and navigate to it: clickhouse-connect && cd clickhouse-connect Create a go.mod file to store the dependencies: go mod init golang-connect-clickhouse Download the Clickhouse driver using the command: go get github.com/ClickHouse/clickhouse-go/v2 Create a new file named main.go, where you will specify the connection data to ClickHouse. package main import ( "database/sql" "log" "github.com/ClickHouse/clickhouse-go/v2" ) func main() { dsn := "tcp://localhost:9000?username=user1&password=PasswordForuser175465&database=new_db" db, err := sql.Open("clickhouse", dsn) if err != nil { log.Fatal(err) } defer db.Close() if err := db.Ping(); err != nil { log.Fatal(err) } log.Println("Connected to ClickHouse!") } Database Connection in JavaScript In JavaScript, all connections to external services are made using the Node.js platform. Make sure that you have Node.js and the npm package manager installed on your device. MySQL and JavaScript To work with MySQL, use the mysql2 driver. Create a directory where we will store the project files: mkdir js-mysql-connect && cd js-mysql-connect Initialize the project: npm init -y Install the mysql2 library: npm install mysql2 Use the following code to connect to MySQL: const mysql = require('mysql2'); const connection_to_mysql = mysql.createConnection({ host: 'localhost', user: 'root', password: 'PasswordForRoot74463', database: db1, }); connection_to_mysql.connect((err) => { if (err) { console.error('Error connecting to MySQL:', err.message); return; } console.log('Successfully connected to MySQL Server!'); connection_to_mysql.end((endErr) => { if (endErr) { console.error('Error closing the connection_to_mysql:', endErr.message); } else { console.log('Connection closed.'); } }); }); PostgreSQL and JavaScript Connecting to PostgreSQL is done using the pg library. Create a directory where we will store the project files: mkdir js-postgres-connect && cd js-postgres-connect Initialize the project: npm init -y Install the pg library: npm install pg To connect to PostgreSQL, first import the pg library. Then, create a constant where you specify variables for the database address, username, password, database name, and port. Use the new pg.Client class to pass the connection data. We will create a table called cities and add two records into it. To do this, we will use the queryDatabase function, which contains the SQL queries. const pg = require('pg'); const config = { postgresql_server_host: '91.206.179.29', postgresql_user: 'gen_user', postgresql_user_password: 'PasswordForGenUser56467$', postgresql_database_name: 'default_db', postgresql_database_port: 5432, }; const client = new pg.Client(config); client.connect(err => { if (err) throw err; else { queryDatabase(); } }); function queryDatabase() { const query = ` DROP TABLE IF EXISTS cities; CREATE TABLE cities (id serial PRIMARY KEY, name VARCHAR(80), population INTEGER); INSERT INTO cities (name, population) VALUES ('Berlin', 3645000); INSERT INTO cities (name, population) VALUES ('Paris', 2161000); `; client .query(query) .then(() => { console.log('Table created successfully!'); client.end(console.log('Closed client connection')); }) .catch(err => console.log(err)) .then(() => { console.log('Finished execution, exiting now'); process.exit(); }); } Use this command to run the code: node connect-to-postgres.js Redis and JavaScript To work with Redis, use the ioredis library. Create a directory to store the project files: mkdir js-redis-connect && cd js-redis-connect Initialize the project: npm init -y Install the ioredis library: npm install ioredis To connect to Redis, import the ioredis library. Then create a constant named redis and specify the Redis server address. Inserting data, i.e., creating key-value objects, is done using an asynchronous function named setData, which takes two values — key and value, corresponding to the data format of the Redis system. const Redis = require('ioredis'); const redis = new Redis({ host: '91.206.179.29', port: 6379, password: 'UY+p8e?Kxmqqfa', }); async function setData(key, value) { try { await redis.set(key, value); console.log('Data successfully set'); } catch (error) { console.error('Error setting data:', error); } } async function getData(key) { try { const value = await redis.get(key); console.log('Data retrieved'); return value; } catch (error) { console.error('Error getting data:', error); } } (async () => { await redis.select(1); await setData('user', 'alex'); await getData('user'); redis.disconnect(); })(); Run: node connect-to-redis.js MongoDB and JavaScript To work with MongoDB, use the mongodb driver. Create a directory for storing the project files: mkdir js-mongodb-connect && cd js-mongodb-connect Initialize the project: npm init -y Install the mongodb library: npm install mongodb To connect to MongoDB, import the mongodb library. Specify the database address in the constant uri and pass the address into the MongoClient class. const { MongoClient } = require('mongodb'); const uri = "mongodb://91.206.179.29:27017"; const client = new MongoClient(uri, { useNewUrlParser: true, useUnifiedTopology: true }); async function connectToDatabase() { try { await client.connect(); console.log("Successfully connected to MongoDB!"); const database = client.db("myDatabase"); const collection = database.collection("myCollection"); const documents = await collection.find({}).toArray(); console.log("Documents found:", documents); } catch (error) { console.error("Error connecting to MongoDB:", error); } finally { await client.close(); console.log("Connection closed."); } } connectToDatabase(); ClickHouse and JavaScript To work with ClickHouse, use the clickhouse/client driver. Create a directory where we will store the project files: mkdir js-clickhouse-connect && cd js-clickhouse-connect Initialize the project: npm init -y Install the @clickhouse/client library: npm install @clickhouse/client To connect to ClickHouse, use the code below where we set the connection details and execute a simple SQL query that will return the first 10 records from the system table named system.tables: const { ClickHouse } = require('@clickhouse/client'); const client = new ClickHouse({ host: 'http://localhost:8123', username: 'default', password: 'PasswordforDefaultUser45435', database: 'default', }); async function connectAndQuery() { try { console.log('Successfully connected to ClickHouse Server!'); const rows = await client.query({ query: 'SELECT * FROM system.tables LIMIT 10', format: 'JSON', }).then((result) => result.json()); console.log('Query results:', rows); } catch (error) { console.error('Error Successfully connected to ClickHouse Server! or running the query:', error); } finally { console.log('Done.'); } } connectAndQuery(); Conclusion In today's article, we thoroughly explored how to connect to PostgreSQL, Redis, MongoDB, MySQL, and ClickHouse databases using Python, Go, and JavaScript. These languages can be used to create both web applications and microservices that utilize databases in their operation.
18 February 2025 · 23 min to read
Python

How to Parse HTML with Python

Parsing is the automatic search for various patterns (based on pre-defined structures) in text data sources to extract specific information. Although parsing is a broad term, it most commonly refers to the process of collecting and analyzing data from remote web resources. In the Python programming language, you can create programs for parsing data from external websites can using two key tools: Standard HTTP request package External HTML markup processing libraries However, data processing capabilities are not limited to just HTML documents. Thanks to a wide range of external libraries in Python, you can organize parsing for documents of any complexity, whether they are arbitrary text, popular markup languages (e.g., XML), or even rare programming languages. If there is no suitable parsing library available, you can implement it manually using low-level methods that Python provides by default, such as simple string searching or regular expressions. Although, of course, this requires additional skills. This guide will cover how to organize parsers in Python. We will focus on extracting data from HTML pages based on specified tags and attributes. We run all the examples in this guide using Python 3.10.12 interpreter on a Hostman cloud server with Ubuntu 22.04 and Pip 22.0.2 as the package manager. Structure of an HTML Document Any document written in HTML consists of two types of tags: Opening: Defined within less-than (<) and greater-than (>) symbols, e.g., <div>. Closing: Defined within less-than (<) and greater-than (>) symbols with a forward slash (/), e.g., </div>. Each tag can have various attributes, the values of which are written in quotes after the equal sign. Some commonly used attributes include: href: Link to a resource. E.g., href="https://hostman.com". class: The class of an object. E.g., class="surface panel panel_closed". id: Identifier of an object. E.g., id="menu". Each tag, with or without attributes, is an element (object) of the so-called DOM (Document Object Model) tree, which is built by practically any HTML interpreter (parser). This builds a hierarchy of elements, where nested tags are child elements to their parent tags. For example, in a browser, we access elements and their attributes through JavaScript scripts. In Python, we use separate libraries for this purpose. The difference is that after parsing the HTML document, the browser not only constructs the DOM tree but also displays it on the monitor. <!DOCTYPE html> <html> <head> <title>This is the page title</title> </head> <body> <h1>This is a heading</h1> <p>This is a simple text.</p> </body> </html> The markup of this page is built with tags in a hierarchical structure without specifying any attributes: html head title body h1 p Such a document structure is more than enough to extract information. We can parse the data by reading the data between opening and closing tags. However, real website tags have additional attributes that specify both the specific function of the element and its special styling (described in separate CSS files): <!DOCTYPE html> <html> <body> <h1 class="h1_bright">This is a heading</h1> <p>This is simple text.</p> <div class="block" href="https://hostman.com/products/cloud-server"> <div class="block__title">Cloud Services</div> <div class="block__information">Cloud Servers</div> </div> <div class="block" href="https://hostman.com/products/vps-server-hosting"> <div class="block__title">VPS Hosting</div> <div class="block__information">Cloud Infrastructure</div> </div> <div class="block" href="https://hostman.com/services/app-platform"> <div class="block__title">App Platform</div> <div class="block__information">Apps in the Cloud</div> </div> </body> </html> Thus, in addition to explicitly specified tags, the required information can be refined with specific attributes, extracting only the necessary elements from the DOM tree. HTML Data Parser Structure Web pages can be of two types: Static: During the loading and viewing of the site, the HTML markup remains unchanged. Parsing does not require emulating the browser's behavior. Dynamic: During the loading and viewing of the site (Single-page application, SPA), the HTML markup is modified using JavaScript. Parsing requires emulating the browser's behavior. Parsing static websites is relatively simple—after making a remote request, the necessary data is extracted from the received HTML document. Parsing dynamic websites requires a more complex approach. After making a remote request, both the HTML document itself and the JavaScript scripts controlling it are downloaded to the local machine. These scripts, in turn, usually perform several remote requests automatically, loading additional content and modifying the HTML document while viewing the page. Because of this, parsing dynamic websites requires emulating the browser’s behavior and user actions on the local machine. Without this, the necessary data simply won’t load. Most modern websites load additional content using JavaScript scripts in one way or another. The variety of technical implementations of modern websites is so large that they can’t be classified as entirely static or entirely dynamic. Typically, general information is loaded initially, while specific information is loaded later. Most HTML parsers are designed for static pages. Systems that emulate browser behavior to generate dynamic content are much less common. In Python, libraries (packages) intended for analyzing HTML markup can be divided into two groups: Low-level processors: Compact, but syntactically complex packages with a complicated implementation that parse HTML (or XML) syntax and build a hierarchical tree of elements. High-level libraries and frameworks: Large, but syntactically concise packages with a wide range of features to extract formalized data from raw HTML documents. This group includes not only compact HTML parsers but also full-fledged systems for data scraping. Often, these packages use low-level parsers (processors) from the first group as their core for parsing. Several low-level libraries are available for Python: lxml: A low-level XML syntax processor that is also used for HTML parsing. It is based on the popular libxml2 library written in C. html5lib: A Python library for HTML syntax parsing, written according to the HTML specification by WHATWG (The Web Hypertext Application Technology Working Group), which is followed by all modern browsers. However, using high-level libraries is faster and easier—they have simpler syntax and a wider range of functions: BeautifulSoup: A simple yet flexible library for Python that allows parsing HTML and XML documents by creating a full DOM tree of elements and extracting the necessary data. Scrapy: A full-fledged framework for parsing data from HTML pages, consisting of autonomous “spiders” (web crawlers) with pre-defined instructions. Selectolax: A fast HTML page parser that uses CSS selectors to extract information from tags. Parsel: A Python library with a specific selector syntax that allows you to extract data from HTML, JSON, and XML documents. requests-html: A Python library that closely mimics browser CSS selectors written in JavaScript. This guide will review several of these high-level libraries. Installing the pip Package Manager We can install all parsing libraries (as well as many other packages) in Python through the standard package manager, pip, which needs to be installed separately. First, update the list of available repositories: sudo apt update Then, install pip using the APT package manager: sudo apt install python3-pip -y The -y flag will automatically confirm all terminal prompts during the installation. To verify that pip was installed correctly, check its version: pip3 --version The terminal will display the pip version and the installation path: pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10) As shown, this guide uses pip version 22.0.2. Installing the HTTP Requests Package Usually, the default Python interpreter includes the Requests package, which allows making requests to remote servers. We will use it in the examples of this guide. However, in some cases, it might not be installed. Then, you can manually install requests via pip: pip install requests If the system already has it, you will see the following message in the terminal: Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.25.1) Otherwise, the command will add requests to the list of available packages for import in Python scripts. Using BeautifulSoup To install BeautifulSoup version 4, use pip: pip install beautifulsoup4 After this, the library will be available for import in Python scripts. However, it also requires the previously mentioned low-level HTML processors to work properly. First, install lxml: pip install lxml Then install html5lib: pip install html5lib In the future, you can specify one of these processors as the core parser for BeautifulSoup in your Python code. Create a new file in your home directory: nano bs.py Add the following code: import requests from bs4 import BeautifulSoup # Request to the website 'https://hostman.com' response = requests.get('https://hostman.com') # Parse the HTML content of the page using 'html5lib' parser page = BeautifulSoup(response.text, 'html5lib') # Extract the title of the page pageTitle = page.find('title') print(pageTitle) print(pageTitle.string) print("") # Extract all <a> links on the page pageParagraphs = page.find_all('a') # Print the content of the first 3 links (if they exist) for i, link in enumerate(pageParagraphs[:3]): print(link.string) print("") # Find all div elements with a class starting with 'socials--' social_links_containers = page.find_all('div', class_=lambda c: c and c.startswith('socials--')) # Collect the links from these divs for container in social_links_containers: links = container.find_all('a', href=True) for link in links: href = link['href'] # Ignore links related to Cloudflare's email protection if href.startswith('/cdn-cgi/l/email-protection'): continue print(href) Now run the script: python bs.py This will produce the following console output: <title>Hostman - Cloud Service Provider with a Global Cloud Infrastructure</title> Hostman - Cloud Service Provider with a Global Cloud Infrastructure Partners Tutorials API https://wa.me/35795959804 https://twitter.com/hostman_com https://www.facebook.com/profile.php?id=61556075738626 https://github.com/hostman-cloud https://www.linkedin.com/company/hostman-inc/about/ https://www.reddit.com/r/Hostman_com/ Of course, instead of html5lib, you can specify lxml: page = BeautifulSoup(response.text, 'lxml') However, it is best to use the html5lib library as the processor. Unlike lxml, which is specifically designed for working with XML markup, html5lib has full support for modern HTML5 standards. Despite the fact that the BeautifulSoup library has a concise syntax, it does not support browser emulation, meaning it cannot dynamically load content. Using Scrapy The Scrapy framework is implemented in a more object-oriented manner. In Scrapy, website parsing is based on three core entities: Spiders: Classes that contain information about parsing details for specified websites, including URLs, element selectors (CSS or XPath), and page browsing mechanisms. Items: Variables for storing extracted data, which are more complex forms of Python dictionaries with a special internal structure. Pipelines: Intermediate handlers for extracted data that can modify items and interact with external software (such as databases). You can install Scrapy through the pip package manager: pip install scrapy After that, you need to initialize a parser project, which creates a separate directory with its own folder structure and configuration files: scrapy startproject parser Now, you can navigate to the newly created directory: cd parser Check the contents of the current directory: ls It has a general configuration file and a directory with project source files: parser scrapy.cfg Move to the source files directory: cd parser If you check its contents: ls You will see both special Python scripts, each performing its function, and a separate directory for spiders: __init__.py items.py middlewares.py pipelines.py settings.py spiders Let's open the settings file: nano settings.py By default, most parameters are commented out with the hash symbol (#). For the parser to work correctly, you need to uncomment some of these parameters without changing the default values specified in the file: USER_AGENT ROBOTSTXT_OBEY CONCURRENT_REQUESTS DOWNLOAD_DELAY COOKIES_ENABLED Each specific project will require a more precise configuration of the framework. You can find all available parameters in the official documentation. After that, you can generate a new spider: scrapy genspider hostmanspider hostman.com After running the above command, the console should display a message about the creation of a new spider: Created spider ‘hostmanspider' using template 'basic' in module: parser.spiders.hostmanspider Now, if you check the contents of the spiders directory: ls spiders You will see the empty source files for the new spider: __init__.py  __pycache__  hostmanspider.py Let's open the script file: nano spiders/hostmanspider.py And fill it with the following code: from pathlib import Path # Package for working with files import scrapy # Package from the Scrapy framework class HostmanSpider(scrapy.Spider): # Spider class inherits from the Spider class name = 'hostmanspider' # Name of the spider def start_requests(self): urls = ["https://hostman.com"] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): open("output", "w").close() # Clear the content of the 'output' file someFile = open("output", "a") # Create (or append to) a new file dataTitle = response.css("title::text").get() # Extract the title from the server response using a CSS selector dataA = response.css("a").getall() # Extract the first 3 links from the server response using a CSS selector someFile.write(dataTitle + "\n\n") for i in range(3): someFile.write(dataA[i] + "\n") someFile.close() You can now run the created spider with the following command: scrapy crawl hostmanspider Running the spider will create an output file in the current directory. To view the contents of this file, you can use: cat output The content of this file will look something like this: Hostman - Cloud Service Provider with a Global Cloud Infrastructure <a href="/partners/" itemprop="url" class="body4 medium nd-link-primary"><span itemprop="name">Partners</span></a> <a href="/tutorials/" itemprop="url" class="body4 medium nd-link-primary"><span itemprop="name">Tutorials</span></a> <a href="/api-docs/" itemprop="url" class="body4 medium nd-link-primary"><span itemprop="name">API</span></a> You can find more detailed information on extracting data using selectors (both CSS and XPath) can be found in the official Scrapy documentation. Conclusion Data parsing from remote sources in Python is made possible by two main components: A package for making remote requests Libraries for parsing data These libraries can range from simple ones, suitable only for parsing static websites, to more complex ones that can emulate browser behavior and, consequently, parse dynamic websites. In Python, the most popular libraries for parsing static data are: BeautifulSoup Scrapy These tools, similar to JavaScript functions (e.g., getElementsByClassName() using CSS selectors), allow us to extract data (attributes and text) from the DOM tree elements of any HTML document.
11 February 2025 · 13 min to read

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start.
Email us
Hostman's Support