- Use SQLite and Python locally to recreate a realistic SQL practice environment without needing a full data warehouse or Spark cluster.
- Master core SQL skills first: filtering with WHERE, joining multiple tables, and aggregating data with GROUP BY and HAVING.
- Normalize schemas into multiple tables with primary and foreign keys, then use JOINs to reconstruct relationships in your analyses.
- Combine local practice with interactive SQL platforms to rehearse interview-style questions and regain confidence with modern data tooling.
If you’re trying to get back into SQL and Python after a few years away, it’s completely normal to feel lost – especially if your last role used proprietary tools and comfy Databricks notebooks that you no longer have. Modern job postings that demand Python, SQL and even PySpark can look intimidating when every guide starts with something like “load your claims dataset into your data warehouse” and you’re thinking: “That’s exactly what I don’t have.”
The good news is that you can recreate most of that learning experience on your own laptop using free tools, small sample datasets, and a structured set of practice problems. In this guide we’ll walk through, in plain English, how to build a realistic local environment, how SQL works (from basic queries to JOINs and aggregation), and how to wrap those SQL queries in Python so you can practice exactly the kind of tasks you’ll face in modern data jobs.
Building a simple local practice environment with SQLite and Python
You don’t need a full-blown data warehouse or Spark cluster to practice SQL plus Python. For learning and interview prep, a lightweight embedded database like SQLite is more than enough. SQLite stores all its data in a single file on disk, which makes it perfect for toy projects, prototypes and educational exercises.
Conceptually, a SQLite database looks a lot like a spreadsheet with multiple sheets: each sheet is a table, each row is a record, and each column is a field. In relational database jargon, tables are sometimes called “relations”, rows are “tuples”, and columns are “attributes”, but for hands‑on work you can happily stick with the everyday terms table, row and column.
Python ships with a built-in SQLite driver called sqlite3, which means you don’t need to install a separate database server. Your Python script will open a connection to a .sqlite file (creating it if it doesn’t exist), obtain a cursor object (very similar to a file handle), and then send SQL commands through that cursor using execute(). See our SQLite SELECT and WHERE guide for practical examples of reading and filtering data.
While this article focuses on driving SQLite from Python, there’s also a handy GUI tool called “Database Browser for SQLite” (sometimes distributed as DB Browser for SQLite). With it you can visually inspect tables, insert or edit a few rows by hand, and run simple SQL statements. It’s like a text editor for database files: quick manual tweaks are easier in the GUI, but anything repetitive or complex is better scripted in Python.
Relational databases are more rigid than Python lists or dicts: they insist on a defined schema. When you create a table you must declare the column names and the data types you expect (text, integer, date/time, etc.). SQLite will then store and index the data in a way that keeps lookups efficient, even as your dataset grows beyond what fits comfortably in memory. For practical learning paths and hands-on examples, consult análisis de datos con SQL.
Creating tables and inserting data with SQL and Python
To start practicing, you first need a table – think of it as designing the shape of your data. Suppose you want a tiny music library table. Using Python’s sqlite3 module you can connect to a database file, drop any old version of the table if it exists, and then create a new table with clearly typed columns.
Here’s what that flow looks like conceptually in Python: you call sqlite3.connect('music.sqlite') to open or create the database file, then call conn.cursor() to get a cursor. Through that cursor you can run SQL commands like DROP TABLE IF EXISTS Songs to clear any previous schema, followed by CREATE TABLE Songs (title TEXT, plays INTEGER) to define a new table with two columns.
Once the table exists, you switch from DDL (Data Definition Language) to DML (Data Manipulation Language) with INSERT statements. In Python you should always use parameterized queries: write INSERT INTO Songs (title, plays) VALUES (?, ?) and pass a tuple like ('Thunderstruck', 20) as the second argument to execute(). The question marks are placeholders that Python will safely substitute, helping you avoid SQL injection issues and quoting bugs.
After performing inserts or updates, you must call conn.commit() to flush your changes to disk. Until you commit, the operations live only in a transaction buffer. This is different from simple file writes, and it’s one of the key habits to build early: query, modify, then commit.
To read your data back, you use a SELECT statement and iterate over the cursor. For example, SELECT title, plays FROM Songs will stream each row as a Python tuple, such as ('Thunderstruck', 20). The cursor does not load all results at once; instead it fetches rows lazily, which is useful when you eventually deal with larger datasets.
Core SQL query elements and filtering with WHERE
Every SQL query is built on a small set of clauses that appear in a standard order: SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. At a minimum you specify what columns you want (SELECT) and from which table (FROM). The optional clauses then refine, aggregate, filter aggregated results, and sort the output.
The WHERE clause filters rows before any grouping or aggregation takes place. For numeric columns you can use comparison operators like =, != (or <>), >, <, >=, <=. Text columns support these plus pattern matching via LIKE and membership checks via IN. Date/time values support the same relational comparisons, and you often see ranges expressed with BETWEEN.
Null handling in SQL is quirky enough that it deserves explicit attention. Regular comparisons like = and != do not behave as you might expect with NULL, so SQL provides IS NULL and IS NOT NULL to check for missing values. Boolean columns usually work with = and !=, but you still need IS NULL when the boolean itself can be missing.
When you combine multiple conditions, remember that AND and OR follow precedence rules. If you write age < 5 OR age > 10 AND breed = 'Ragdoll', SQL will evaluate the AND first. To express “Ragdoll cats younger than 5 or older than 10”, you should use parentheses: (age < 5 OR age > 10) AND breed = 'Ragdoll'. Getting comfortable with these logical combinations is crucial for real‑world analytics work.
Pattern matching with LIKE lets you search for strings that start, end or contain certain fragments. The percent sign % is a wildcard for any sequence of characters, so breed LIKE 'R%' finds breeds that begin with “R”, fav_toy LIKE 'ball%' finds toys whose names start with “ball”, and coloration LIKE '%m' finds color patterns that end with “m”. Paired with AND/OR, this becomes a powerful text-filtering toolkit.
Practicing single-table queries with a toy dataset
A helpful way to build muscle memory is to fix a small schema in your head and solve many queries against it. Imagine a cat table with columns like id, name, breed, coloration, age, sex, and fav_toy. This gives you enough variety – text, numbers, simple categoricals – to practice most basic query patterns.
For boolean-style checks, you often filter on one column and then layer additional conditions. To list “boring” male cats that have no favorite toy recorded, you would select the name where sex = 'M' and fav_toy IS NULL. This illustrates how null checks pair with straightforward comparisons to isolate a specific subset of rows.
To target specific breeds or exclude them, you combine equality with logical negation. Selecting only Ragdoll cats with certain ages uses breed = 'Ragdoll'; excluding Persians and Siamese could look like breed NOT LIKE 'Persian' AND breed NOT LIKE 'Siamese'. While some databases support NOT IN ('Persian', 'Siamese'), practicing the explicit pattern helps cement your understanding of NOT and LIKE.
Exercises like “female cats that love teaser toys and are not Persian or Siamese” force you to blend text filters, equality and logical operators. You’d select id, name, breed, coloration and constrain the rows using sex = 'F', fav_toy = 'teaser', and a compound condition that excludes the unwanted breeds. Paying attention to parentheses ensures that all subconditions are applied in the intended combination.
Once you’re comfortable with these toy examples in raw SQL, re‑implement them via Python using parameterized queries. Write short scripts that ask for breed, minimum age or toy type from input(), plug them into WHERE clauses, and print out the results. This is exactly the bridge between query-writing and real application code that many junior data roles expect.
Understanding and practicing SQL JOINs
As soon as you move beyond toy problems, you’ll join multiple tables constantly. JOINs are how you connect related datasets: customers to orders, artists to artworks, games to companies, and so on. In SQL, you describe which columns should match between the tables, and the database engine merges the rows into a combined result set.
There are four primary join types you’ll meet in interviews and real projects: INNER JOIN (often written just JOIN), LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. An inner join returns only rows where both tables have matching keys; a left join keeps all rows from the left table, filling in NULLs when the right table has no match; a right join does the symmetric thing; and a full outer join returns every row from both sides, matching where possible and using NULL where not.
Think of LEFT JOIN and RIGHT JOIN as “trust this side more” operations. With a left join, the left table is the primary source of truth: every row from it appears at least once in the output, even when the right table contributes nothing. With a full join, neither side is privileged – you simply union together all keys from both tables and align them where they overlap.
To keep multi-table queries readable, always alias your tables. Instead of writing SELECT artist.name repeatedly, write FROM artist AS a and then reference columns as a.name. Similarly, piece_of_art can become poa, and museum can be m. When your query grows to three or more joins, good aliases are the difference between clarity and chaos.
A classic training setup uses a trio of tables: artist, museum, and piece_of_art. The artist table might hold id, name, birth_year, death_year and a primary field like watercolor or sculpture. The museum table stores id, name and country. The piece_of_art table holds id, name, artist_id and museum_id. Those last two columns are foreign keys that link each artwork to its creator and location.
With that schema, you can practice inner joins, left joins and conditional filters. For example, to list artists born after 1800 who lived more than 50 years, alongside the names of their works, you’d join artist and piece_of_art on artist.id = piece_of_art.artist_id and then filter with death_year - birth_year > 50 and birth_year > 1800. Alias the selected columns as artist_name and piece_name for clarity.
To see all artworks together with the museum names and countries – including “lost” pieces with no museum – you’d use a LEFT JOIN from piece_of_art to museum on museum_id. That way, artworks without an associated museum still appear in the result, with NULL in the museum columns. Filtering rows where artist_id IS NULL lets you detect works by unknown artists while still joining to the museums holding them.
More advanced exercises have you join three tables simultaneously. To list each artwork with both its artist and the museum name, you’d join museum to piece_of_art on museum.id = piece_of_art.museum_id, then join artist on artist.id = piece_of_art.artist_id. Using plain JOIN (inner join) intentionally drops artworks that either lack an artist or a museum, giving you a feel for how join type impacts the row count.
Aggregation, GROUP BY, and HAVING in practice
Once you can retrieve and join data, the next big skill is summarizing it. Aggregation functions like SUM(), AVG(), COUNT(), MAX(), and MIN() compute metrics over sets of rows. GROUP BY partitions your dataset into groups and applies those functions within each group – for instance, one group per year, per company, or per artist. If you prefer structured courses to practice these concepts, see a comprehensive SQL course.
Imagine a simple sales_table with columns year, month, and sales. A plain SELECT SUM(sales) AS total_sales FROM sales_table gives you the grand total across all rows. Adding GROUP BY year changes the question: now you’re asking for total sales per year rather than a single overall number.
The key rule is that every non-aggregated column in your SELECT must appear in GROUP BY. If you select year and SUM(sales), you group by year. If you select year and month together with aggregates, then you group by both year and month. Conceptually, distinct combinations of the grouped columns define the groups.
WHERE and HAVING are both filters, but they act at different stages. WHERE filters raw rows before any grouping or aggregation occurs. HAVING filters the grouped results using aggregate expressions. For example, you might WHERE production_year BETWEEN 2000 AND 2009 and then HAVING SUM(revenue) > 4000000 to keep only companies whose “good games” generated more than four million in revenue.
A more realistic practice schema is a games table with columns like id, title, company, type, production_year, system, production_cost, revenue, and rating. With this single table you can exercise averages, counts, sums, grouping and ranking – the bread and butter of analytics SQL.
For instance, to compute the average production cost of games released from 2010 to 2015 with rating greater than 7, you’d select AVG(production_cost) and constrain rows with WHERE production_year BETWEEN 2010 AND 2015 AND rating > 7. That’s a classic interview-style question, and you can easily embed it into Python and print the resulting single number.
You can also produce year-level statistics directly from the same games table. Group by production_year, then compute COUNT(*) AS count, AVG(production_cost) AS avg_cost, and AVG(revenue) AS avg_revenue. This kind of query gives you a compact time series view that’s extremely common in BI dashboards and reporting tools.
To rank companies by gross profit across all years, you can aggregate on company. A handy pattern is SELECT company, SUM(revenue - production_cost) AS gross_profit_sum FROM games GROUP BY 1 ORDER BY 2 DESC. Here GROUP BY 1 and ORDER BY 2 use column positions in the SELECT list, which can keep things concise but must be used carefully so you don’t break queries by reordering columns later.
More complex prompts tie together filters, grouping and post-aggregation filters. Suppose you define “good games” as those produced between 2000 and 2009, with rating above 6 and revenue greater than production cost. For each company, you want the number of such games plus their total revenue, but only for companies whose revenue from good games exceeds 4,000,000. You’d filter rows with WHERE on production_year, rating, and profitability, group by company, compute COUNT(company) and SUM(revenue), then apply HAVING SUM(revenue) > 4000000. This one query captures most of the real-world mental steps you’ll face in analytics tasks.
Modeling data with multiple tables and keys
Single-table designs get you pretty far, but relational databases shine when you normalize data across multiple tables. Normalization is the process of eliminating redundant storage and representing relationships via keys. That keeps your database smaller, faster, and less error-prone.
A simple but instructive example comes from crawling Twitter-like social graphs. Say you want to track user accounts and the “follows” relationships between them. One naive approach would be a single table where each row duplicates both follower and followee names as text. That quickly leads to heavy repetition and inconsistent spelling.
Instead, you split things into a People table and a Follows table. People might have an integer id as primary key, a unique name (the screen name or handle), and a retrieved flag signalling whether you have already crawled that account’s friend list. Follows holds pairs of integers from_id and to_id, representing directed connections from one user to another.
Three key concepts structure this model: logical keys, primary keys, and foreign keys. A logical key is what the outside world uses to refer to a record – here, the Twitter handle in name. A primary key is usually a database-generated integer (id) that uniquely identifies each row and is cheap to index and compare. A foreign key is an integer that points to a primary key in another table – from_id and to_id in the Follows table are foreign keys referencing People.id.
To enforce data quality, you declare constraints in your table definitions. For example, name TEXT UNIQUE in People ensures you can’t accidentally insert two rows with the same handle. A UNIQUE(from_id, to_id) constraint in Follows prevents you from storing the same follow edge more than once. These constraints double as safety nets when you start writing upsert logic in Python.
In Python’s sqlite3 module, a common pattern is to use INSERT OR IGNORE to respect those constraints gracefully. If you attempt to insert a name that already exists, SQLite will quietly skip the operation rather than error out. You can then check cursor.rowcount to see whether a row was truly added, and rely on cursor.lastrowid to discover the assigned id for newly inserted users.
When your code receives a new screen name, it should first try to look up the corresponding id. If a SELECT id FROM People WHERE name = ? returns a row, you reuse that integer. If not, you insert the name with retrieved = 0, commit, and then read lastrowid. That “find or insert” pattern is at the heart of many data ingestion scripts.
Once both follower and followee IDs are known, recording the relationship in Follows is just another INSERT OR IGNORE. Your UNIQUE(from_id, to_id) constraint addresses duplicates, and you can concentrate on the higher-level logic of which profiles to crawl next, rather than micromanaging row deduplication.
Using JOIN to reconstruct relationships from normalized tables
Normalized schemas trade redundancy for indirection: you store integers instead of repeated strings, but now you must join tables to reconstruct the full picture. This is exactly what SQL JOIN was designed for, and once you’re used to it, JOIN-heavy queries feel entirely natural.
In the social-graph example, if you want to see who user with id = 2 is following, you would join Follows to People on the target side. Conceptually, you run SELECT * FROM Follows JOIN People ON Follows.to_id = People.id WHERE Follows.from_id = 2. This produces combined rows that contain both the numerical edge and the human-readable name for every followee.
Each row in that result is a “meta-row” that merges columns from both tables. The first two columns might be (from_id, to_id) from Follows, while the subsequent columns belong to People – like (id, name, retrieved). Because the JOIN condition enforces Follows.to_id = People.id, you can see that relationship explicitly: the second column and the third column of each row match.
This same pattern extends naturally to more tables. You already saw it with artist, piece_of_art, and museum, and the Twitter crawler illustrates it with People and Follows. In more complex analytical pipelines, you might join fact tables (events, orders) to multiple dimension tables (users, products, campaigns) to answer multi-faceted questions.
When debugging your code or learning how the schema fits together, a workflow of “run Python, then inspect with DB Browser for SQLite” is extremely effective. Execute your script to populate the database, close any GUI instance that’s keeping the file locked, then open the .sqlite file in the browser. From there you can inspect the contents of each table and run ad‑hoc SELECT queries to verify your assumptions.
One caveat: SQLite enforces file locks, so if the DB Browser has the database open in edit mode, your Python script may fail to connect or commit. The fix is to close the database in the GUI (or exit the browser entirely) before running your Python code again. Getting into the habit of closing tools that lock your DB file will save you from mysterious “database is locked” errors.
Bringing these techniques together – schema design, constraints, parameterized queries in Python, JOINs, GROUP BY and HAVING – gives you a powerful local lab for practicing exactly the kind of SQL and Python work you’ll do on the job. With nothing more than SQLite and a few well-structured sample tables, you can rehearse interview-style questions, prototype analytic logic, and regain your confidence with modern data tooling.
Where platforms like DataLemur and interactive courses fit in
On top of your local practice, interactive platforms can give you a more guided experience with instant feedback. Tools born out of real-world industry experience – for example, platforms created by former Facebook and Google data engineers who spent their days writing SQL and Python and running A/B tests – often center their content around genuine interview questions and analytics scenarios.
Books that cover statistics, machine learning and business intuition for data interviews are great for theory, but they don’t always provide the hands-on SQL playground many learners crave. That gap is exactly what some modern tools aim to fill: they repackage hundreds of interview-style prompts into an in-browser SQL and analytics environment so you can run, tweak and re‑run your queries without worrying about local setup. You can also try applied examples like customer churn risk evaluation to combine SQL with basic machine learning workflows.
You’ll also find interactive SQL courses that mirror the topics we’ve walked through here: single-table queries with SELECT and WHERE, joins across two or three tables, aggregation and grouping, subqueries, and more. Many of these courses rely on realistic datasets – think games, museums, or transactional sales – so that the questions feel like genuine business problems rather than contrived puzzles.
If you’re feeling overwhelmed by documentation for tools like PySpark, DuckDB or dbt, it’s perfectly reasonable to defer them until your SQL fundamentals feel solid. Focusing first on SQLite plus Python lets you internalize the core query patterns without fighting cluster configuration or cloud permissions. Once the basics are second nature, learning PySpark becomes more about distributed execution than about new query concepts.
Ultimately, the combination of a simple local setup, structured practice problems, and occasional use of interactive platforms gives you the best of all worlds: full control over your environment, strong conceptual grounding, and exposure to the style of questions top employers love. With steady practice, the once-daunting mix of SQL, Python and data engineering tools becomes a familiar, even enjoyable, toolkit you can wield confidently in new roles.
Putting it all together, your path forward is clear: spin up a SQLite database with Python, design a few realistic tables, drill basic and intermediate SQL patterns (filters, joins, aggregation, grouping, HAVING), wrap those queries in Python scripts, and optionally supplement your learning with interactive SQL platforms built by practitioners who’ve been exactly where you are now; by doing so, you’ll rebuild your technical instincts, reduce the anxiety around modern data stacks, and be ready to handle the SQL and Python demands of today’s data roles.