# NoSQL Databases

## NoSQL Databases

NoSQL databases are non-tabular and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads.

![](https://1172597814-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MJ6Mj8gFbz9Ji6QL6Zi%2F-MM_w0pYDGh_tiQhqEc4%2F-MM_y0D8SsQsKPl4CYek%2Ftypes-of-nosql-datastores.png?alt=media\&token=a591f988-d063-47b5-b00f-582c26fccf4d)

### Types of NoSQL Databases

1. **Document databases** (e.g. CouchDB, MongoDB, Google Firestore). Inserted data is stored in the form of free-form JSON structures or "documents," where the data could be anything from integers to strings to freeform text. There is no inherent need to specify what fields, if any, a document will contain.
2. **Key-value stores** (e.g. Redis, Riak, DynamoDB). Free-form values—from simple integers or strings to complex JSON documents—are accessed in the database by way of keys. Extremely fast for simple lookups.
3. **Wide column stores** (e.g. HBase, Cassandra, Google Bigtable). Data is stored in columns instead of rows as in a conventional SQL system. Any number of columns (and therefore many different types of data) can be grouped or aggregated as needed for queries or data views.
4. **Graph databases** (e.g. Neo4j, Amazon Neptune). Data is represented as a network or graph of entities and their relationships, with each node in the graph a free-form chunk of data.

### Core Characteristics of NoSQL Databases

**Flexible Schema**: NoSQL databases allow for dynamic schemas where documents in the same collection can have different structures. New fields can be added without affecting existing data or requiring schema migrations.

**Horizontal Scalability**: NoSQL databases are designed to scale out by distributing data across multiple servers (sharding), rather than scaling up by adding more resources to a single server.

**Eventual Consistency**: Many NoSQL databases favor availability and partition tolerance over immediate consistency (CAP theorem), meaning data may be temporarily inconsistent across nodes but will eventually converge.

**Denormalization**: Data is often duplicated across documents to optimize read performance and avoid expensive joins, trading storage space for query speed.

**High Performance**: Optimized for specific access patterns, often providing faster read/write operations for their intended use cases compared to traditional SQL databases.

**Distributed Architecture**: Built-in support for replication and distribution across geographic regions, providing high availability and fault tolerance.

{% hint style="info" %}

### MongoDB Schema Considerations

While MongoDB is often described as "schema-less," this is misleading. MongoDB doesn't enforce schemas at the database level, but applications typically require consistent document structures to function properly. MongoDB offers optional **schema validation rules** at the collection level to enforce data types and required fields. Additionally, Object-Document Mappers (ODMs) like Mongoose for Node.js provide application-layer schema definitions with type enforcement, defaults, and validators. This gives developers flexibility to iterate quickly while maintaining data consistency. Unlike SQL, schema changes don't require database migrations—you can add new fields immediately, though your application must handle both old and new document structures during transitions.
{% endhint %}

### NoSQL vs SQL

|                                  | **SQL**                                              | **NoSQL**                                                          |
| -------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------ |
| **Type**                         | Relational                                           | Non-Relational                                                     |
| **Data Model**                   | Structured data in tables with rows and columns      | Varies: documents, key-value pairs, wide columns, or graphs        |
| **Schema**                       | Static, rigid, predefined schema required            | Dynamic, flexible, schema-less or schema-flexible                  |
| **Scalability**                  | Vertical (scale up - more powerful hardware)         | Horizontal (scale out - more servers)                              |
| **Query Language**               | SQL (standardized)                                   | Database-specific (varied APIs, query languages)                   |
| **Joins**                        | Efficient complex joins across tables                | Limited or no joins; data often denormalized                       |
| **Transactions**                 | ACID (Atomicity, Consistency, Isolation, Durability) | Often BASE (Basically Available, Soft state, Eventual consistency) |
| **Data Integrity**               | Enforced through foreign keys, constraints           | Application-level validation, less strict enforcement              |
| **Suitable for Large Datasets**  | Can handle large data but scaling is challenging     | Designed for massive datasets and high throughput                  |
| **Suitable for Complex Queries** | Excellent for complex analytical queries             | Best for simple queries; complex aggregations can be challenging   |
| **Support & Maturity**           | Mature ecosystem, extensive tooling and expertise    | Growing community, evolving standards                              |
| **Auto Elasticity**              | Often requires downtime for scaling                  | Automatic, seamless scaling without downtime                       |
| **Cost**                         | Licensing costs for enterprise versions              | Many open-source options, pay-as-you-go cloud models               |
| **Use Case**                     | Complex transactions, reporting, analytics           | Real-time apps, content management, IoT, social networks           |

### Advantages and Disadvantages

#### NoSQL Advantages

* **Flexibility**: Adapt to changing requirements without schema migrations
* **Scalability**: Easily scale horizontally across multiple servers
* **Performance**: Optimized for specific access patterns, faster for certain operations
* **High availability**: Built-in replication and distribution
* **Developer-friendly**: JSON-like documents map naturally to objects in programming languages
* **Handle unstructured data**: Store varied data types without rigid structure
* **Cost-effective scaling**: Commodity hardware for horizontal scaling

#### NoSQL Disadvantages

* **Lack of standardization**: Each database has its own query language and API
* **Limited query capabilities**: Complex queries and joins are challenging
* **Eventual consistency**: Data may be temporarily inconsistent across nodes
* **Data redundancy**: Denormalization leads to duplicate data and potential inconsistencies
* **Less mature ecosystem**: Fewer tools, less expertise compared to SQL
* **No ACID guarantees**: Many NoSQL databases trade consistency for availability
* **Learning curve**: Different paradigm requires new thinking about data modeling

### Typical use cases

#### When to use NoSQL

**Real-time Web Applications**

* Chat applications (Slack, WhatsApp)
* Collaborative editing tools (Google Docs-like apps)
* Live dashboards and analytics
* **Why**: Fast writes, flexible schema, real-time sync capabilities

**Content Management Systems**

* Blogs, news sites, wikis
* E-commerce product catalogs
* Digital asset management
* **Why**: Flexible document structure, easy to add new content types

**IoT and Sensor Data**

* Smart home devices
* Industrial monitoring
* Vehicle telemetry
* **Why**: High write throughput, time-series data, horizontal scaling

**Social Networks**

* User profiles and feeds
* Activity streams
* Friend graphs
* **Why**: Flexible user data, graph relationships, massive scale

**Mobile Applications**

* Offline-first apps
* Real-time synchronization
* User-generated content
* **Why**: JSON documents match mobile data structures, built-in sync

**Gaming**

* Player profiles and statistics
* Leaderboards
* Game state storage
* **Why**: Fast reads/writes, flexible data models, scalability

**Big Data and Analytics**

* Log aggregation
* Event tracking
* Clickstream analysis
* **Why**: Handle massive volumes, time-series data, parallel processing

#### When to Use SQL

**Financial Systems**: Banking, accounting, invoicing (ACID compliance critical)

**Enterprise Resource Planning (ERP)**: Complex business logic with many relationships

**Traditional E-commerce**: Inventory management, order processing with complex transactions

**Reporting and Business Intelligence**: Complex analytical queries, aggregations, joins

**Data Warehousing**: Historical data analysis, complex reporting requirements

### Data modeling: Normalization vs Denormalization

#### SQL Normalization Forms

SQL databases use normalization to reduce data redundancy and maintain data integrity. The normalization process organizes data into tables according to specific rules called normal forms.

**First Normal Form (1NF)**

* Eliminate repeating groups
* Each column contains atomic (indivisible) values
* Each record is unique

**Example violation:** A table with a column containing multiple phone numbers separated by commas.

**Second Normal Form (2NF)**

* Must be in 1NF
* All non-key attributes must depend on the entire primary key
* Eliminate partial dependencies

**Example:** In an order details table with composite key (OrderID, ProductID), product name should not depend only on ProductID.

**Third Normal Form (3NF)**

* Must be in 2NF
* No transitive dependencies (non-key attributes depending on other non-key attributes)
* All attributes depend directly on the primary key

**Example violation:** Storing both CustomerCity and CustomerCountry in an Orders table when Country can be derived from City.

**Boyce-Codd Normal Form (BCNF)**

* Stricter version of 3NF
* Every determinant must be a candidate key
* Resolves anomalies not handled by 3NF

**Fourth Normal Form (4NF)**

* Must be in BCNF
* No multi-valued dependencies
* Separate independent many-to-many relationships

**Fifth Normal Form (5NF)**

* Must be in 4NF
* No join dependencies
* Cannot be decomposed into smaller tables without loss of data

#### NoSQL Denormalization Strategy

NoSQL databases often intentionally denormalize data to optimize read performance and avoid complex joins. Common strategies include:

**Embedding**: Store related data within the same document

```json
{
  "user_id": "123",
  "name": "John Doe",
  "orders": [
    {"order_id": "001", "product": "Laptop", "price": 999},
    {"order_id": "002", "product": "Mouse", "price": 29}
  ]
}
```

**Duplication**: Repeat data across multiple documents

```json
// Order document includes user info
{
  "order_id": "001",
  "user": {
    "user_id": "123",
    "name": "John Doe",
    "email": "john@example.com"
  },
  "product": "Laptop"
}
```

**Aggregation**: Pre-compute and store aggregated values

```json
{
  "user_id": "123",
  "total_orders": 45,
  "total_spent": 12450,
  "last_order_date": "2024-10-15"
}
```

**Trade-offs of Denormalization**:

* **Pros**: Faster reads, no joins needed, better scalability
* **Cons**: Data redundancy, update complexity, potential inconsistencies, more storage

{% embed url="<https://www.youtube.com/watch?v=W2Z7fbCLSTw>" %}

### Practical Examples: SQL vs NoSQL

#### 1. SQL Sample

A relational database like SQL (MySQL, PostgreSQL, etc.) uses structured tables with predefined columns. Here's an example of a "Users" table with user-related data:

**Table: Users**

| user\_id | first\_name | last\_name | email                  | date\_of\_birth | city        |
| -------- | ----------- | ---------- | ---------------------- | --------------- | ----------- |
| 1        | John        | Doe        | <john.doe@gmail.com>   | 1990-05-10      | New York    |
| 2        | Jane        | Smith      | <jane.smith@yahoo.com> | 1988-07-22      | Los Angeles |
| 3        | Mark        | Johnson    | <mark.j@outlook.com>   | 1995-03-15      | Chicago     |

**Table: Orders**

| order\_id | user\_id | product\_name | quantity | order\_date |
| --------- | -------- | ------------- | -------- | ----------- |
| 101       | 1        | Laptop        | 1        | 2023-10-20  |
| 102       | 2        | Smartphone    | 2        | 2023-10-22  |
| 103       | 1        | Headphones    | 1        | 2023-10-23  |

**Explanation:**

* Data is normalized across multiple tables
* Foreign keys `user_id` establish relationships
* Queries use JOINs to combine related data
* Schema is rigid and must be defined upfront

#### 2. Firebase Firestore Sample

Firebase Firestore is a NoSQL database where data is stored in documents and collections with flexible schema.

**Collection: Users**

```json
{
  "user_1": {
    "first_name": "John",
    "last_name": "Doe",
    "email": "john.doe@gmail.com",
    "date_of_birth": "1990-05-10",
    "city": "New York",
    "orders": [
      {
        "order_id": "101",
        "product_name": "Laptop",
        "quantity": 1,
        "order_date": "2023-10-20"
      },
      {
        "order_id": "103",
        "product_name": "Headphones",
        "quantity": 1,
        "order_date": "2023-10-23"
      }
    ]
  },
  "user_2": {
    "first_name": "Jane",
    "last_name": "Smith",
    "email": "jane.smith@yahoo.com",
    "date_of_birth": "1988-07-22",
    "city": "Los Angeles",
    "orders": [
      {
        "order_id": "102",
        "product_name": "Smartphone",
        "quantity": 2,
        "order_date": "2023-10-22"
      }
    ]
  }
}
```

**Explanation:**

* Data is denormalized and stored hierarchically
* Orders are embedded within user documents
* No foreign keys or joins needed
* Schema is flexible - different users can have different fields

#### 3. Complex Unstructured Data Example

A basic example of data that cannot be easily stored in SQL would be **real-time chat messages** with embedded media, metadata, and nested replies.

**Real-Time Chat Messages Data:**

```json
{
  "message_id": "msg_123",
  "sender": {
    "user_id": "user_1",
    "name": "John Doe"
  },
  "content": "Hey, check out this picture!",
  "timestamp": "2024-10-24T14:30:00Z",
  "media": {
    "type": "image",
    "url": "https://example.com/image123.jpg",
    "metadata": {
      "resolution": "1920x1080",
      "size_in_kb": 450
    }
  },
  "reactions": [
    {
      "user_id": "user_2",
      "emoji": "👍",
      "timestamp": "2024-10-24T14:32:00Z"
    },
    {
      "user_id": "user_3",
      "emoji": "😂",
      "timestamp": "2024-10-24T14:35:00Z"
    }
  ],
  "replies": [
    {
      "message_id": "msg_124",
      "sender": {
        "user_id": "user_2",
        "name": "Jane Smith"
      },
      "content": "That's a cool picture!",
      "timestamp": "2024-10-24T14:34:00Z",
      "reactions": []
    }
  ]
}
```

**Why this is difficult in SQL:**

1. **Variable and Nested Structures**: Messages contain deeply nested data (`replies`, `reactions`, `media`) with different schemas. SQL requires predefined schemas and would need multiple tables with complex joins.
2. **Arrays and Embedded Documents**: Fields like `reactions` and `replies` are arrays of objects. SQL doesn't handle arrays natively, requiring separate tables for each reaction or reply.
3. **Dynamic Data**: Some fields like `media` may exist for some messages but not others. SQL's rigid schema would result in many NULL values or complex conditional designs.
4. **Real-Time and Scalability**: Chat applications require high-frequency updates and flexible structures. NoSQL databases handle this efficiently through document-based storage and horizontal scaling.

**How NoSQL Handles It:**

* Entire message stored as single document
* No schema constraints - fields vary between documents
* Nested arrays and objects handled natively
* Real-time updates without complex joins

### Graph Databases: Neo4j

Neo4j is a graph database where data is stored as nodes and relationships instead of tables. It's particularly effective for handling connected data like social networks, recommendations, or complex hierarchies.

#### Key Concepts

**Nodes**: Represent entities (like people, products) with labels and properties **Relationships**: Connect nodes, have types and directions, can contain properties **Properties**: Key-value pairs stored on both nodes and relationships **Labels**: Categories for nodes (like :Person, :Product)

#### Cypher Query Language

Cypher is Neo4j's query language, designed to be visual and intuitive:

```cypher
// Find John's friends
MATCH (p:Person {name: 'John'})-[:FRIENDS_WITH]->(friend)
RETURN friend.name

// Create a new person and relationship
CREATE (john:Person {name: 'John', age: 30})
  -[:WORKS_AT]->
  (company:Company {name: 'Acme Corp'})

// Find friends of friends
MATCH (p:Person {name: 'John'})-[:FRIENDS_WITH*2]->(fof)
RETURN DISTINCT fof.name

// Recommend products based on purchase patterns
MATCH (user:Person {name: 'John'})-[:PURCHASED]->(product:Product)
      <-[:PURCHASED]-(other:Person)-[:PURCHASED]->(recommendation:Product)
WHERE NOT (user)-[:PURCHASED]->(recommendation)
RETURN recommendation.name, COUNT(*) as score
ORDER BY score DESC
```

#### Neo4j Use Cases

**Fraud Detection & Security**

* Banking: Pattern detection for suspicious transactions, money laundering networks
* Cybersecurity: Network analysis, threat detection, attack path analysis
* Insurance: Claims fraud investigation by connecting entities and events

**Recommendations**

* E-commerce: Product recommendations based on purchase and browsing patterns
* Social Media: Friend suggestions, content recommendations (LinkedIn connections)
* Entertainment: Content recommendations (Netflix, Spotify recommendation engines)

**Knowledge Graphs**

* Scientific Research: Connecting research papers, authors, citations, and topics
* Enterprise Knowledge Management: Company documentation, expertise location
* AI/ML: Knowledge bases for natural language processing and reasoning

**Identity & Access Management**

* Role-based access control with complex permission hierarchies
* Complex organizational structures and reporting lines
* Access dependency tracking and audit trails

**Network and IT Operations**

* Data center topology and dependency mapping
* Impact analysis for infrastructure changes
* Root cause analysis for outages

The common thread is that Neo4j excels when dealing with highly connected data where relationships are as important as the data points themselves, and where traversing those relationships is a primary operation.

### Choosing Between SQL and NoSQL

**Choose SQL when:**

* Data structure is well-defined and stable
* ACID compliance is critical (financial transactions)
* Complex queries and joins are frequent
* Data integrity and relationships are paramount
* Vertical scaling is acceptable
* Need mature ecosystem and standardization

**Choose NoSQL when:**

* Schema flexibility is required
* Horizontal scalability is essential
* High write/read throughput needed
* Data is unstructured or semi-structured
* Rapid development and iteration required
* Eventually consistent data is acceptable
* Need built-in distribution and replication

**Polyglot Persistence**: Many modern applications use both SQL and NoSQL databases, choosing the right tool for each specific data model and access pattern. For example, using SQL for transactional data and NoSQL for user sessions, caching, or real-time features.
