Skip to content

Latest commit

 

History

History
141 lines (110 loc) · 4.15 KB

ELASTIC.md

File metadata and controls

141 lines (110 loc) · 4.15 KB

How to use Elastic Search

Connect to the server with a read-only account

from elasticsearch import Elasticsearch
es = Elasticsearch(
         cloud_id="TODO",
         api_key="TODO",
         retry_on_timeout=True,
         http_compress=True)

Find out which indices exist

indices = [name for name in es.indices.get("*").keys() if not name.startswith(".")]

# Or in the newer es version

indices = [name for name in es.indices.get(index="*").keys() if not name.startswith(".")]

At the moment, this will give the following indices:

  • c4: All of C4, including the "en", "en.noblocklist", and "en.noclean" subsets.
  • laion1b-nolang: Portion of LAION that doesn't have a detected language
  • laion2b-multi-1: Part one of LAION with a detected language other than English
  • laion2b-multi-2: Part two of LAION with a detected language other than English
  • metrics-endpoint.metadata_current_default: Internal ES stuff
  • openwebtext: Is what it says.
  • re_laion2b-en-1: Part one of LAION with a detected language of English
  • re_laion2b-en-2: Part two of LAION with a detected language of English
  • re_oscar: OSCAR.
  • re_pile: PILE
  • s2orc-abstracts: Abstracts from S2ORC. The URLs in this are Semantic Scholar URLs generated from the paper IDs.
  • search-test: Test index that's empty. I keep this around to look at the default mappings from time to time.

Search over one index

This searches for the word "water" in the OSCAR dataset.

es.search(index="re_oscar", body={
    "query": {
        "match": {
            "text": "water"
        }
     }
})

Search over multiple indices

Because LAION has more documents than can fit into one Elastic Search index, it is split over multiple indices. Fortunately, you can query more than one index at a time. Here is an example for searching for the word "water" in all of LAION-2B-en:

es.search(index="re_laion2b-en-*", body={
    "query": {
        "match": {
            "text": "water"
        }
    }
})

Search over C4 subsets

Because C4 is so big, and the subsets have considerable overlap, we didn't want to make one index per subset. Instead, every document has a "subset" field that can be queried. Here is how you search for "water" only in the "en" subset:

es.search(index="c4", body={
    "query": {
        "bool": {
            "must": {
                "match": {"text": "water"}
             },
             "filter": {
                 "term": {"subset": "en"}
             }
         }
     }
})

Getting documents when you already have a document ID

When you already have a document ID, you don't need to search. You can just retrieve the document. It's much faster. This is how you do it:

es.get("re_laion2b-en-2", doc_id)

Document IDs are usually derived from the source data, if the source data has a reasonable ID that we can use. If they can't be derived from the source data, they are derived from the JSON file and line number that the document came from.

  • LAION has an "ID" field in the source data, but it's not a unique identifier. Many LAION documents share the same ID, so I didn't use that field.
  • C4 document IDs are hashes of the URL. See below.

C4 document IDs are hashes of the URL

I wish we could just use the URL as document ids, but they are too long. So instead, we use a hash of the URL as document ID. This is how you turn a URL into a document ID:

import hashlib
encoded_url = url.strip().encode("UTF8")
doc_id = hashlib.blake2b(encoded_url).hexdigest()[:512]

Access random documents

import h5py
import random


dt = h5py.string_dtype(encoding='utf-8')
f = h5py.File("ids.h5", "r")
ds = f["ids"]

random_doc_id = random.choice(ds).decode()


es.get(index="laion2b-en-2", id=random_doc_id)