Spark MongoDB connector does not fetch all the fields present in stored BSON document in a collection.
This is because Mongo collection can have documents with different schema. Typically, all documents in a collection are of similar or related purpose. A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.
So, when we read MongoDB Collection using Spark connector. It infers schema as per first row it may read, which might not consist of fields which are present in subsequent tuples/ rows.
Suppose We have a MongoDB collection - default.fruits, and it has following documents -
{ "_id" : 1, "type" : "apple"}
{ "_id" : 2, "type" : "orange", "qty" : 10 }
{ "_id" : 3, "type" : "banana" }
Code to connect read Mongo Collection -
Note- We have Spark 2.4, Scala 2.11 and mongo-spark-connector_2.11:2.3.5
- Execute below command
- Or, download and then execute-
- Execute below to read Mongo Collection - default.fruits
- Check Schema -
- Print Data
Note - We missed "qty" field for "_id"=2
Solution to above problem is to specify Schema externally, rather then allowing Spark to infer it from data.
- Pick up one document in JSON string format with minimal schema that is needed and save it in a file. For ex - create a file named "sample.json" and save following -
- Read the JSON file
- Read from MongoDB collection specifying schema, like below -
- Check Schema -
- Print Data
Comments
Post a Comment